Re: [netmod] Potential additions to rfc6087bis: RegEx guidelines

Robert Wilton <rwilton@cisco.com> Thu, 31 August 2017 13:53 UTC

Return-Path: <rwilton@cisco.com>
X-Original-To: netmod@ietfa.amsl.com
Delivered-To: netmod@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 07292132DA2 for <netmod@ietfa.amsl.com>; Thu, 31 Aug 2017 06:53:44 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -14.499
X-Spam-Level:
X-Spam-Status: No, score=-14.499 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, USER_IN_DEF_DKIM_WL=-7.5] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cisco.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id NR4qTGn0XPpb for <netmod@ietfa.amsl.com>; Thu, 31 Aug 2017 06:53:41 -0700 (PDT)
Received: from aer-iport-3.cisco.com (aer-iport-3.cisco.com [173.38.203.53]) (using TLSv1.2 with cipher DHE-RSA-SEED-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 70F19132D87 for <netmod@ietf.org>; Thu, 31 Aug 2017 06:53:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cisco.com; i=@cisco.com; l=37409; q=dns/txt; s=iport; t=1504187620; x=1505397220; h=subject:to:references:from:message-id:date:mime-version: in-reply-to; bh=5+O19vIdBWcSlefbcMJSxb+WK61Qs1SJ11Ikl6PmH8A=; b=WHLaTd8Iojkpi0Y7bm++OaLse606DcL+6ubjcc4ays0+XRcnTMRvcMGy WOrfh9F3c4XLnvpAfXkdnHddAvAkby2DE3/NiNYn0jG7yxW3IhjvDq0Jw FLXzK6vS7Nxkv5ph9ADFpXOgwiiNXXTBrp8D78vriBvz0qbBYIqNGpY06 A=;
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: A0BvAQA0FKhZ/xbLJq1aAxkBAQEBAQEBAQEBAQcBAQEBAYQ+gRWPC5Ebd5UwDoIBAyEBCoRMTwKESxYBAgEBAQEBAQFrKIUYAQEBAQIBAQEYCUsQCQIJAg4CCCABBgMCAhsMHxEGAQwGAgEBFYoQCBCPZ51mgicnix4BAQEBAQEBAQEBAQEBAQEBAQEBAQEdBYMlg1CBYyuCSDWEQgESAQkcGyaCTIJhBYoDE4h2hSWIPpRRghOJQCSGd41SiHMmCyaBAgsyIQgcFUmHHD82iBqCMgEBAQ
X-IronPort-AV: E=Sophos;i="5.41,453,1498521600"; d="scan'208,217";a="655331464"
Received: from aer-iport-nat.cisco.com (HELO aer-core-1.cisco.com) ([173.38.203.22]) by aer-iport-3.cisco.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 31 Aug 2017 13:53:35 +0000
Received: from [10.63.23.66] (dhcp-ensft1-uk-vla370-10-63-23-66.cisco.com [10.63.23.66]) by aer-core-1.cisco.com (8.14.5/8.14.5) with ESMTP id v7VDrZ0f005516; Thu, 31 Aug 2017 13:53:35 GMT
To: Lou Berger <lberger@labn.net>, Andy Bierman <andy@yumaworks.com>, Juergen Schoenwaelder <j.schoenwaelder@jacobs-university.de>, Xufeng Liu <Xufeng_Liu@jabil.com>, netmod@ietf.org
References: <599F0991.7020900@tail-f.com> <BN3PR0201MB0867A248887538077CD5D49FF19B0@BN3PR0201MB0867.namprd02.prod.outlook.com> <20170825125254.6nhnzkrar6fhu7zr@elstar.local> <BN3PR0201MB086796F09BFD77FCD718C21BF19E0@BN3PR0201MB0867.namprd02.prod.outlook.com> <20170828154640.pzg7jfy5uepkb22q@elstar.local> <c8de6140-af50-0a4b-a479-b011a8dfbbe7@cisco.com> <CABCOCHRNt3Tkxy8Ffz3JGgPe-rQYwZ3MTLmD43OQi4P6tZQJmg@mail.gmail.com> <f7151a6b-9deb-52ad-62a9-78b29a552540@cisco.com> <20170830102902.2n5q6rgq2x2dxfq2@elstar.local> <e8482a9c-cba3-28e2-9ffa-ec5eb5c1c0a4@cisco.com> <20170830123156.cssrg5kklpo67fie@elstar.local> <CABCOCHTtN611FO2ov2kTLtZx-Q3=tzgH7Xk9uGvFUD1WuyMZyw@mail.gmail.com> <b13c5e9a-e9f9-96e9-8823-0402fb74af09@cisco.com> <15e34d507d0.27d3.9b4188e636579690ba6c69f2c8a0f1fd@labn.net>
From: Robert Wilton <rwilton@cisco.com>
Message-ID: <a4def542-f161-dc65-d23c-039d2cc1811c@cisco.com>
Date: Thu, 31 Aug 2017 14:53:35 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0
MIME-Version: 1.0
In-Reply-To: <15e34d507d0.27d3.9b4188e636579690ba6c69f2c8a0f1fd@labn.net>
Content-Type: multipart/alternative; boundary="------------D0C0CD24D03586F38C92A9C8"
Content-Language: en-US
Archived-At: <https://mailarchive.ietf.org/arch/msg/netmod/Fq33UHyIYB_MItQJxaZ6bWTk2xc>
Subject: Re: [netmod] Potential additions to rfc6087bis: RegEx guidelines
X-BeenThere: netmod@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: NETMOD WG list <netmod.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/netmod>, <mailto:netmod-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/netmod/>
List-Post: <mailto:netmod@ietf.org>
List-Help: <mailto:netmod-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/netmod>, <mailto:netmod-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 31 Aug 2017 13:53:44 -0000

Hi Lou, all

Proposed 6087bis text inline below.

On 30/08/2017 21:28, Lou Berger wrote:
>
> Rob
>
> Speaking as a contributor.
>
>
> On August 30, 2017 12:44:42 PM Robert Wilton <rwilton@cisco.com> wrote:
>
> > Hi,
> >
> > On 30/08/2017 15:52, Andy Bierman wrote:
> >>
> >>
> >> On Wed, Aug 30, 2017 at 5:31 AM, Juergen Schoenwaelder
> >> <j.schoenwaelder@jacobs-university.de
> >> <mailto:j.schoenwaelder@jacobs-university.de>> wrote:
> >>
> >>     On Wed, Aug 30, 2017 at 12:48:19PM +0100, Robert Wilton wrote:
> >>     >
> >>     >
> >>     > On 30/08/2017 11:29, Juergen Schoenwaelder wrote:
> >>     > > On Wed, Aug 30, 2017 at 10:16:30AM +0100, Robert Wilton wrote:
> >>     > > > Hi Andy,
> >>     > > >
> >>     > > > What I am suggesting makes it easier for readers, because I
> >>     am a proponent
> >>     > > > of simpler regular expressions that are easy to read and
> >>     understand.
> >>     > > >
> >>     > > > For example, I wonder how many YANG model readers would
> >>     immediately
> >>     > > > comprehend what this pattern statement means:
> >>     > > >
> >>     > > > pattern "\p{Sc}\p{Zs}?\p{Nd}+\.\p{Nd}{2}"?
> >>     > > >
> >>     > > > Does allowing such patterns really make it easier for model
> >>     readers?
> >>     > > This is always difficult to judge but to be fair you have to
> >>     show how
> >>     > > you express _the same_ (and not a subset) with some other 
> kind of
> >>     > > regular expressions. (My understanding is that \p{Sc} is a
> >>     currency
> >>     > > symbol.)
> >>     > Yes, the expression would cover a currency amount, along with
> >>     associated
> >>     > symbol (e.g. "$200.00").
> >>     >
> >>     > If I was writing a module, I would probably use the following
> >>     pattern
> >>     > statement instead, which I think a lot more people would likely
> >>     be able to
> >>     > comprehend:
> >>     >
> >>     > pattern "[A-Z]{3}\s?\d+\.\d{2}", using the 3 letter, ISO 4217,
> >>     currency codes.  e.g. ("USD 200.00")
> >>
> >>     But that is not the same. Apples versus oranges. (I expect 
> people to
> >>     tell me that (i) currency is irrelevant and (ii) that three ASCII
> >>     letter currency acronyms are better than currency symbols 
> anyway but
> >>     this is a separate discussion I am not interested in.)
> >>
> >>     > >
> >>     > > > The proposes guidelines obviously make it easier (or at
> >>     least no harder) for
> >>     > > > tool makers.
> >>     > > >
> >>     > > > I agree that there is an minor impact to model writers, but
> >>     really only in
> >>     > > > the sense that the guidelines would be telling them not to
> >>     use the esoteric
> >>     > > > options of the XML regex syntax that they probably don't
> >>     know about anyway.
> >>     > > What is 'esoteric' largely depends on your language
> >>     environment. What
> >>     > > you are saying by 'do not use \p{}' is essentially 'do not 
> use any
> >>     > > unicode long live ASCII'.
> >>     > No, that is not my intention, i.e. I'm not suggesting banning
> >>     all use of
> >>     > \p{}, but instead limiting it to the character classes that seem
> >>     like they
> >>     > may plausibly be used in standardized YANG modules.
> >>
> >>     This is entirely subjective. And if you still allow some \p{}, 
> what is
> >>     the point of the exercise?
> >>
> >>     > I'm not trying to change what 6020/7950 defines the pattern
> >>     statement as,
> >>     > just give what I perceive as some pragmatic guidance as to what
> >>     parts of XML
> >>     > RE it makes sense to use in standardized YANG modules, making it
> >>     easier for
> >>     > readers and implementations.
> >>     >
> >>     > I think that it is fine for companies, vendors, etc to use the
> >>     full breadth
> >>     > of XML RE if they wish.
> >>
> >>     Implementations have to be prepared to handle XSD pattern if they
> >>     claim compliance to YANG 1.0 and 1.1. So all this only helps
> >>     non-compliant implementations. This may indeed be a goal - but 
> then we
> >>     should spell this out as such - this helps non-compliant
> >>     implementations (and they may still fail on the first \p{} that
> >>     you still allow).
> >>
> >>     If implementations do not implement the YANG pattern statement but
> >>     something else, then then they should ignore patterns they can't
> >>     understand and treat the pattern as if it would have been in a
> >>     description clause - i.e., leave it to humans to write the code 
> that
> >>     implements the pattern correctly. Note that YANG does not say 
> anything
> >>     how stuff is implemented.
> >>
> >>
> >>
> >> This does not work.
> >> There are 3 outcomes from the regex compiler
> >>
> >> 1) proper syntax was used and accepted; pattern matches correctly
> >> 2) improper syntax was used and accepted; pattern matches incorrectly
> >> 3) improper syntax was used and rejected; compiler error generated
> >>
> >> Case (2) is the really bad one and we have seen in in bug reports.
> >>
> >> This issue was discussed in detail for almost 2 years and the
> >> conclusion was
> >> that a YANG extension would be used to specify other pattern types than
> >> the XSD pattern mandated by the standard.
> > I actually think that XML RE is a good choice for YANG pattern
> > statements (because it is one of the more simple RE languages), I just
> > don't think that we need all of it.
> >
> >
> > First question: How many pattern statements in draft and standard IETF
> > YANG modules actually use Unicode properties (e.g \p{}).
> > Answer: Just 2.  To add a zone at the end of the IPv4/IPv6 address.
> >
> > E.g.       pattern
> > '(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\.){3}'
> >        + '([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])'
> >        + '(%[\p{N}\p{L}]+)?';
> >
> > This could quite possibly have been written just as
> > "\d{1,3}\.{3}\d{1,3)(%\w+)?" and not use Unicode properties at all.
> >
> > There a couple more occurrences of Unicode character classes in the
> > vendor models on github, but only to restrict them to the ASCII
> > character set (oh the irony), which I believe can be accomplished
> > without resorting to Unicode properties.
> >
> >
> > Another question: How often is character class subtraction (e.g.
> > [A-Z-[PQ]] used in standard & the github YANG modules?
> > Answer: 0.  AFAICT, it isn't used at all, anywhere ...
> >
> >
> >
> > Now, I'm not proposing using a different regex syntax for pattern
> > statements, just a sensible subset of XSD RE, such that it easier for
> > folks to read/review pattern statements, and it is easier for client 
> and
> > server implementations to translate into other common regex
> > implementations if they so wish.
> >
> > Of course, as part of that translation, I would expect a translation
> > function to check and generate an error if the translation cannot 
> handle
> > the input regex (e.g. if it uses an obscure unmatched unicode property
> > or a unicode block, or character class subtraction syntax).  This 
> really
> > doesn't seem hard to me.
> >
> > But the XML RE language has stuff in it that I don't think anyone is
> > ever going to use in a standardized network management YANG model.
> > Forcing everyone to implement support for this stuff just seems like a
> > complete waste of time and effort.  Looking at the regex info 
> website it
> > looks like there are about 143 unicode properties and blocks defined 
> (it
> > may be incomplete), or which I think that 135+ of these probably 
> have no
> > relevance in network management YANG modules, and the benefit of the
> > remaining ones is pretty suspect.
> >
> > I mean, how many network management YANG modules really need a pattern
> > statement that only matches Runic characters?  Perhaps someone out 
> there
> > is busy defining "middle-earth.yang" ;-)
> >
> > If I am the only person opposed to making life unnecessarily difficult
> > to readers of YANG models, and client/server tool implementors
> > interacting with YANG then it is probably time to give up this
> > discussion. ;-)
> >
>
> I agree with you 100%
>
> And I see Xufeng's proposal for 6087bis as an attempt at putting some 
> language together to support this desire. Perhaps you can suggest 
> alternate language.
>
I propose adding the following 2 paragraphs to 6087bis section on 
pattern and ranges:

NEW:
To ensure patterns are easy to read and implement, authors SHOULD
restrict themselves to the parts of the XML schema regular expression
language that are common across most regular expression languages.  In
particular, pattern statements SHOULD avoid using 'character class
subtraction' (e.g. '[a-z-[aeiou]]').  They SHOULD avoid matching
unicode properties and blocks (e.g. '\p{L} or \p{IsBasic_Latin}').
They MAY use the '\d', '\w', '\s' character class shorthands and their
negated versions, where appropriate, but SHOULD avoid other character
class shorthands.  To match ASCII digits 0-9 the character class
'[0-9]' MUST be used instead of the '\d' character class shorthand
that matches Unicode digits in all scripts.

Pattern statements do not have to strictly restrict numerical values,
and a simple less specific pattern may be preferable over a more
complex and precise pattern, e.g. as illustrated in the
'ipv4-address-no-zone' example pattern below.


Or, in context of the existing text 6087bis text:

*** Patterns and Ranges

For string data types, if a machine-readable pattern
can be defined for the desired semantics, then
one or more pattern statements SHOULD be present.
A single quoted string SHOULD be used to specify the pattern,
since a double-quoted string can modify the content.

To ensure patterns are easy to read and implement, authors SHOULD
restrict themselves to the parts of the XML schema regular expression
language that are common across most regular expression languages.  In
particular, pattern statements SHOULD avoid using 'character class
subtraction' (e.g. '[a-z-[aeiou]]').  They SHOULD avoid matching
unicode properties and blocks (e.g. '\p{L} or \p{IsBasic_Latin}').
They MAY use the '\d', '\w', '\s' character class shorthands and their
negated versions, where appropriate, but SHOULD avoid other character
class shorthands.  To match ASCII digits 0-9 the character class
'[0-9]' MUST be used instead of the '\d' character class shorthand
that also matches Unicode digits in all scripts.

Pattern statements do not have to strictly restrict numerical values,
and a simple less specific pattern may be preferable over a more
complex and precise pattern, e.g. as illustrated in the
'ipv4-address-no-zone' example pattern below.

The following typedef from ^RFC6991^ demonstrates the proper
use of the "pattern" statement:

     typedef ipv4-address-no-zone {
       type inet:ipv4-address {
         pattern '[0-9\.]*';
       }
       ...
     }

For string data types, if the length of the string
is required to be bounded in all implementations,
then a length statement MUST be present.

The following typedef from ^RFC6991^ demonstrates the proper
use of the "length" statement:

     typedef yang-identifier {
       type string {
         length "1..max";
         pattern '[a-zA-Z_][a-zA-Z0-9\-_.]*';
         pattern '.|..|[^xX].*|.[^mM].*|..[^lL].*';
       }
       ...
     }

For numeric data types, if the values allowed
by the intended semantics are different than
those allowed by the unbounded intrinsic data
type (e.g., 'int32'), then a range statement SHOULD be present.

The following typedef from ^RFC6991^ demonstrates the proper
use of the "range" statement:

     typedef dscp {
       type uint8 {
          range "0..63";
       }
       ...
     }

Thanks,
Rob


> Lou
>
>
> > Python, quite likely a common tool for client side network management,
> > also doesn't seem to have any support of unicode properties or blocks.
> > Perhaps implementations will hook it up to libxml2 instead, or write a
> > full translation XML RE to Python RE conversion tool. But probably most
> > people will just feed the pattern statement into the native Python 
> regex
> > engine, and my guess is that this will probably work 95% of the time.
> > The other 5% ... who knows what will happen ... oh well, better to try
> > and fail than to not try at all.
> >
> > Apologies if this email comes across as a rant.
> >
> > Rob
> >
> >
> >>
> >>
> >>     /js
> >>
> >>
> >> Andy
> >>
> >>     --
> >>     Juergen Schoenwaelder           Jacobs University Bremen gGmbH
> >>     Phone: +49 421 200 3587         Campus Ring 1 | 28759 Bremen | 
> Germany
> >>     Fax:   +49 421 200 3103         <http://www.jacobs-university.de/
> >>     <http://www.jacobs-university.de/>>
> >>
> >>
> >
> >
> >
> >
> > ----------
> > _______________________________________________
> > netmod mailing list
> > netmod@ietf.org
> > https://www.ietf.org/mailman/listinfo/netmod
> >
>