Re: [netmod] Potential additions to rfc6087bis: RegEx guidelines

Robert Wilton <rwilton@cisco.com> Wed, 30 August 2017 16:44 UTC

Return-Path: <rwilton@cisco.com>
X-Original-To: netmod@ietfa.amsl.com
Delivered-To: netmod@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 29F1F126BFD for <netmod@ietfa.amsl.com>; Wed, 30 Aug 2017 09:44:10 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -14.5
X-Spam-Level:
X-Spam-Status: No, score=-14.5 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5, SPF_PASS=-0.001, USER_IN_DEF_DKIM_WL=-7.5] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cisco.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id cvtbKizcsHDx for <netmod@ietfa.amsl.com>; Wed, 30 Aug 2017 09:44:07 -0700 (PDT)
Received: from aer-iport-2.cisco.com (aer-iport-2.cisco.com [173.38.203.52]) (using TLSv1.2 with cipher DHE-RSA-SEED-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 9C2181201F8 for <netmod@ietf.org>; Wed, 30 Aug 2017 09:44:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cisco.com; i=@cisco.com; l=22330; q=dns/txt; s=iport; t=1504111446; x=1505321046; h=subject:to:references:from:message-id:date:mime-version: in-reply-to; bh=MVFgRSa0Ff75OZPeIOQ+t2BgEzM2XOV3bDwdDBT2bmU=; b=ULt3Yr2+Qq1yAUn+hsjUW1W4zE+X7YUW553vOerVmr1pqTDntwS/6m8Y wJWqhTpB3BnBJ64PKhe8iBWY+oZCInicYm6krXFUWBgOI/gECpeWqfJf8 ZWTyiRagqigWqSEPmsFPHTGsujlSo5U+dyMnhvR5WVGkBXMycsE8ZtkgW g=;
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: A0CDAwA66qZZ/xbLJq1bAxoBAQEBAgEBAQEIAQEBAYQ+gRWDd4sSkHUiljWCBCiDQIFfAoRqFAECAQEBAQEBAWsohRgBAQEBAgEjWwkCCQIQCCcDAgIbKxEGAQwGAgEBFYoQCI8LnWaCJyeLGwEBAQEBAQEBAgEBAQEBAQEBAQEBHQWDJYNQgg4Lgj01hEIBEgEJHBsmgkyCYQWKFoh2hSKIPpRMghKJQCSGdY1QiHM2IUFBCzIhCBwVhSiCPT82iFqCMgEBAQ
X-IronPort-AV: E=Sophos;i="5.41,449,1498521600"; d="scan'208,217";a="654288007"
Received: from aer-iport-nat.cisco.com (HELO aer-core-3.cisco.com) ([173.38.203.22]) by aer-iport-2.cisco.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 30 Aug 2017 16:44:01 +0000
Received: from [10.63.23.66] (dhcp-ensft1-uk-vla370-10-63-23-66.cisco.com [10.63.23.66]) by aer-core-3.cisco.com (8.14.5/8.14.5) with ESMTP id v7UGi1Fh016275; Wed, 30 Aug 2017 16:44:01 GMT
To: Andy Bierman <andy@yumaworks.com>, Juergen Schoenwaelder <j.schoenwaelder@jacobs-university.de>, Xufeng Liu <Xufeng_Liu@jabil.com>, "netmod@ietf.org" <netmod@ietf.org>
References: <599F0991.7020900@tail-f.com> <BN3PR0201MB0867A248887538077CD5D49FF19B0@BN3PR0201MB0867.namprd02.prod.outlook.com> <20170825125254.6nhnzkrar6fhu7zr@elstar.local> <BN3PR0201MB086796F09BFD77FCD718C21BF19E0@BN3PR0201MB0867.namprd02.prod.outlook.com> <20170828154640.pzg7jfy5uepkb22q@elstar.local> <c8de6140-af50-0a4b-a479-b011a8dfbbe7@cisco.com> <CABCOCHRNt3Tkxy8Ffz3JGgPe-rQYwZ3MTLmD43OQi4P6tZQJmg@mail.gmail.com> <f7151a6b-9deb-52ad-62a9-78b29a552540@cisco.com> <20170830102902.2n5q6rgq2x2dxfq2@elstar.local> <e8482a9c-cba3-28e2-9ffa-ec5eb5c1c0a4@cisco.com> <20170830123156.cssrg5kklpo67fie@elstar.local> <CABCOCHTtN611FO2ov2kTLtZx-Q3=tzgH7Xk9uGvFUD1WuyMZyw@mail.gmail.com>
From: Robert Wilton <rwilton@cisco.com>
Message-ID: <b13c5e9a-e9f9-96e9-8823-0402fb74af09@cisco.com>
Date: Wed, 30 Aug 2017 17:44:01 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0
MIME-Version: 1.0
In-Reply-To: <CABCOCHTtN611FO2ov2kTLtZx-Q3=tzgH7Xk9uGvFUD1WuyMZyw@mail.gmail.com>
Content-Type: multipart/alternative; boundary="------------398571CD940287B8D048BEF3"
Content-Language: en-US
Archived-At: <https://mailarchive.ietf.org/arch/msg/netmod/CDC6qkKXlCIjUcVLmENbBTA_U2g>
Subject: Re: [netmod] Potential additions to rfc6087bis: RegEx guidelines
X-BeenThere: netmod@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: NETMOD WG list <netmod.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/netmod>, <mailto:netmod-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/netmod/>
List-Post: <mailto:netmod@ietf.org>
List-Help: <mailto:netmod-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/netmod>, <mailto:netmod-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Aug 2017 16:44:10 -0000

Hi,

On 30/08/2017 15:52, Andy Bierman wrote:
>
>
> On Wed, Aug 30, 2017 at 5:31 AM, Juergen Schoenwaelder 
> <j.schoenwaelder@jacobs-university.de 
> <mailto:j.schoenwaelder@jacobs-university.de>> wrote:
>
>     On Wed, Aug 30, 2017 at 12:48:19PM +0100, Robert Wilton wrote:
>     >
>     >
>     > On 30/08/2017 11:29, Juergen Schoenwaelder wrote:
>     > > On Wed, Aug 30, 2017 at 10:16:30AM +0100, Robert Wilton wrote:
>     > > > Hi Andy,
>     > > >
>     > > > What I am suggesting makes it easier for readers, because I
>     am a proponent
>     > > > of simpler regular expressions that are easy to read and
>     understand.
>     > > >
>     > > > For example, I wonder how many YANG model readers would
>     immediately
>     > > > comprehend what this pattern statement means:
>     > > >
>     > > > pattern "\p{Sc}\p{Zs}?\p{Nd}+\.\p{Nd}{2}"?
>     > > >
>     > > > Does allowing such patterns really make it easier for model
>     readers?
>     > > This is always difficult to judge but to be fair you have to
>     show how
>     > > you express _the same_ (and not a subset) with some other kind of
>     > > regular expressions. (My understanding is that \p{Sc} is a
>     currency
>     > > symbol.)
>     > Yes, the expression would cover a currency amount, along with
>     associated
>     > symbol (e.g. "$200.00").
>     >
>     > If I was writing a module, I would probably use the following
>     pattern
>     > statement instead, which I think a lot more people would likely
>     be able to
>     > comprehend:
>     >
>     > pattern "[A-Z]{3}\s?\d+\.\d{2}", using the 3 letter, ISO 4217,
>     currency codes.  e.g. ("USD 200.00")
>
>     But that is not the same. Apples versus oranges. (I expect people to
>     tell me that (i) currency is irrelevant and (ii) that three ASCII
>     letter currency acronyms are better than currency symbols anyway but
>     this is a separate discussion I am not interested in.)
>
>     > >
>     > > > The proposes guidelines obviously make it easier (or at
>     least no harder) for
>     > > > tool makers.
>     > > >
>     > > > I agree that there is an minor impact to model writers, but
>     really only in
>     > > > the sense that the guidelines would be telling them not to
>     use the esoteric
>     > > > options of the XML regex syntax that they probably don't
>     know about anyway.
>     > > What is 'esoteric' largely depends on your language
>     environment. What
>     > > you are saying by 'do not use \p{}' is essentially 'do not use any
>     > > unicode long live ASCII'.
>     > No, that is not my intention, i.e. I'm not suggesting banning
>     all use of
>     > \p{}, but instead limiting it to the character classes that seem
>     like they
>     > may plausibly be used in standardized YANG modules.
>
>     This is entirely subjective. And if you still allow some \p{}, what is
>     the point of the exercise?
>
>     > I'm not trying to change what 6020/7950 defines the pattern
>     statement as,
>     > just give what I perceive as some pragmatic guidance as to what
>     parts of XML
>     > RE it makes sense to use in standardized YANG modules, making it
>     easier for
>     > readers and implementations.
>     >
>     > I think that it is fine for companies, vendors, etc to use the
>     full breadth
>     > of XML RE if they wish.
>
>     Implementations have to be prepared to handle XSD pattern if they
>     claim compliance to YANG 1.0 and 1.1. So all this only helps
>     non-compliant implementations. This may indeed be a goal - but then we
>     should spell this out as such - this helps non-compliant
>     implementations (and they may still fail on the first \p{} that
>     you still allow).
>
>     If implementations do not implement the YANG pattern statement but
>     something else, then then they should ignore patterns they can't
>     understand and treat the pattern as if it would have been in a
>     description clause - i.e., leave it to humans to write the code that
>     implements the pattern correctly. Note that YANG does not say anything
>     how stuff is implemented.
>
>
>
> This does not work.
> There are 3 outcomes from the regex compiler
>
> 1) proper syntax was used and accepted; pattern matches correctly
> 2) improper syntax was used and accepted; pattern matches incorrectly
> 3) improper syntax was used and rejected; compiler error generated
>
> Case (2) is the really bad one and we have seen in in bug reports.
>
> This issue was discussed in detail for almost 2 years and the 
> conclusion was
> that a YANG extension would be used to specify other pattern types than
> the XSD pattern mandated by the standard.
I actually think that XML RE is a good choice for YANG pattern 
statements (because it is one of the more simple RE languages), I just 
don't think that we need all of it.


First question: How many pattern statements in draft and standard IETF 
YANG modules actually use Unicode properties (e.g \p{}).
Answer: Just 2.  To add a zone at the end of the IPv4/IPv6 address.

E.g.       pattern
         '(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\.){3}'
       +  '([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])'
       + '(%[\p{N}\p{L}]+)?';

This could quite possibly have been written just as 
"\d{1,3}\.{3}\d{1,3)(%\w+)?" and not use Unicode properties at all.

There a couple more occurrences of Unicode character classes in the 
vendor models on github, but only to restrict them to the ASCII 
character set (oh the irony), which I believe can be accomplished 
without resorting to Unicode properties.


Another question: How often is character class subtraction (e.g. 
[A-Z-[PQ]] used in standard & the github YANG modules?
Answer: 0.  AFAICT, it isn't used at all, anywhere ...



Now, I'm not proposing using a different regex syntax for pattern 
statements, just a sensible subset of XSD RE, such that it easier for 
folks to read/review pattern statements, and it is easier for client and 
server implementations to translate into other common regex 
implementations if they so wish.

Of course, as part of that translation, I would expect a translation 
function to check and generate an error if the translation cannot handle 
the input regex (e.g. if it uses an obscure unmatched unicode property 
or a unicode block, or character class subtraction syntax).  This really 
doesn't seem hard to me.

But the XML RE language has stuff in it that I don't think anyone is 
ever going to use in a standardized network management YANG model. 
Forcing everyone to implement support for this stuff just seems like a 
complete waste of time and effort.  Looking at the regex info website it 
looks like there are about 143 unicode properties and blocks defined (it 
may be incomplete), or which I think that 135+ of these probably have no 
relevance in network management YANG modules, and the benefit of the 
remaining ones is pretty suspect.

I mean, how many network management YANG modules really need a pattern 
statement that only matches Runic characters?  Perhaps someone out there 
is busy defining "middle-earth.yang" ;-)

If I am the only person opposed to making life unnecessarily difficult 
to readers of YANG models, and client/server tool implementors 
interacting with YANG then it is probably time to give up this 
discussion. ;-)

Python, quite likely a common tool for client side network management, 
also doesn't seem to have any support of unicode properties or blocks.  
Perhaps implementations will hook it up to libxml2 instead, or write a 
full translation XML RE to Python RE conversion tool.  But probably most 
people will just feed the pattern statement into the native Python regex 
engine, and my guess is that this will probably work 95% of the time.  
The other 5% ... who knows what will happen ... oh well, better to try 
and fail than to not try at all.

Apologies if this email comes across as a rant.

Rob


>
>
>     /js
>
>
> Andy
>
>     --
>     Juergen Schoenwaelder           Jacobs University Bremen gGmbH
>     Phone: +49 421 200 3587         Campus Ring 1 | 28759 Bremen | Germany
>     Fax:   +49 421 200 3103         <http://www.jacobs-university.de/
>     <http://www.jacobs-university.de/>>
>
>