Re: [netmod] Potential additions to rfc6087bis: RegEx guidelines

Ladislav Lhotka <lhotka@nic.cz> Wed, 06 September 2017 08:33 UTC

Return-Path: <lhotka@nic.cz>
X-Original-To: netmod@ietfa.amsl.com
Delivered-To: netmod@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 792E81326EA for <netmod@ietfa.amsl.com>; Wed, 6 Sep 2017 01:33:14 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7
X-Spam-Level:
X-Spam-Status: No, score=-7 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_HI=-5] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=nic.cz
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id btDYf0Zwz6Mo for <netmod@ietfa.amsl.com>; Wed, 6 Sep 2017 01:33:12 -0700 (PDT)
Received: from mail.nic.cz (mail.nic.cz [IPv6:2001:1488:800:400::400]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 4EF141326BB for <netmod@ietf.org>; Wed, 6 Sep 2017 01:33:11 -0700 (PDT)
Received: from birdie109 (unknown [IPv6:2001:718:1a02:1::380]) by mail.nic.cz (Postfix) with ESMTPSA id 7892C621CA; Wed, 6 Sep 2017 10:33:09 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=nic.cz; s=default; t=1504686789; bh=n35XL9XVg3GN80Yax3vcT8iNgArOPKBoKCHWjWFSv4o=; h=From:To:Date; b=jRa5eXI7HD/AAZhAFBLQHi4RPqayTUxxZZpQYa9rEacFio4Q/MDZQwe37ZitOd/s6 +bRKP2pZtsXf+IlW8mWiGOwI842M+a1f9gJevZjenJvJ5ubZdHAiODujY1lY4losk3 MfjdDJAIwd0sW/fRWgKXnvJDzJ+jRCzoURoiJvEg=
Message-ID: <1504686822.3468.22.camel@nic.cz>
From: Ladislav Lhotka <lhotka@nic.cz>
To: Robert Wilton <rwilton@cisco.com>, Lou Berger <lberger@labn.net>, netmod@ietf.org
Date: Wed, 06 Sep 2017 10:33:42 +0200
In-Reply-To: <cbe34a3e-cf6d-6da7-07fb-ad544892453d@cisco.com>
References: <f7151a6b-9deb-52ad-62a9-78b29a552540@cisco.com> <20170830102902.2n5q6rgq2x2dxfq2@elstar.local> <e8482a9c-cba3-28e2-9ffa-ec5eb5c1c0a4@cisco.com> <20170830123156.cssrg5kklpo67fie@elstar.local> <CABCOCHTtN611FO2ov2kTLtZx-Q3=tzgH7Xk9uGvFUD1WuyMZyw@mail.gmail.com> <b13c5e9a-e9f9-96e9-8823-0402fb74af09@cisco.com> <1504223854014.55228@Aviatnet.com> <847e5bf9-7b3d-9ff8-9954-970f32a2094c@cisco.com> <20170902073342.xoziwor4tdr5bipw@elstar.local> <D5D00209.C5C67%acee@cisco.com> <20170902112832.ymorfgdthobeio6q@elstar.local> <CABCOCHTC2MhBu0Zu44Z=f+J04HiENjQR+J0Sxy-arjcDmBHb_A@mail.gmail.com> <1e95ba5d-7aa2-e08f-56f9-27aa70822a11@cisco.com> <1504537140.5874.38.camel@nic.cz> <f0ddf7bd-c249-389f-e34b-0b901697307e@cisco.com> <1504629352.7175.40.camel@nic.cz> <8af6041d-7cd5-9608-70b4-7cffc4f884f8@cisco.com> <2a70ce5e-7727-d280-98e4-481d87314d14@labn.net> <cbe34a3e-cf6d-6da7-07fb-ad544892453d@cisco.com>
Organization: CZ.NIC
Content-Type: text/plain; charset="UTF-8"
X-Mailer: Evolution 3.24.5
Mime-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Virus-Scanned: clamav-milter 0.99.2 at mail
X-Virus-Status: Clean
Archived-At: <https://mailarchive.ietf.org/arch/msg/netmod/VZH8Rg5lQZCIIovVdVib7VZ9UWQ>
Subject: Re: [netmod] Potential additions to rfc6087bis: RegEx guidelines
X-BeenThere: netmod@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: NETMOD WG list <netmod.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/netmod>, <mailto:netmod-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/netmod/>
List-Post: <mailto:netmod@ietf.org>
List-Help: <mailto:netmod-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/netmod>, <mailto:netmod-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 06 Sep 2017 08:33:14 -0000

Robert Wilton píše v St 06. 09. 2017 v 08:52 +0100:
> Hi Lou,
> 
> This is the addition to 6087bis that I propose.   Note, this is the same 
> text in my email on the 31st of August.
> 
> I propose adding the following 2 paragraphs to 6087bis section on 
> pattern and ranges:
> 
> NEW:
> To ensure patterns are easy to read and implement, authors SHOULD
> restrict themselves to the parts of the XML schema regular expression
> language that are common across most regular expression languages.  In
> particular, pattern statements SHOULD avoid using 'character class
> subtraction' (e.g. '[a-z-[aeiou]]').  They SHOULD avoid matching
> unicode properties and blocks (e.g. '\p{L} or \p{IsBasic_Latin}').
> They MAY use the '\d', '\w', '\s' character class shorthands and their
> negated versions, where appropriate, but SHOULD avoid other character
> class shorthands.  To match ASCII digits 0-9 the character class

I don't agree, things like \p{L} may be useful, at least in this part of the
world.

Moreover, \w means "any Unicode character not defined as punctuation, separator,
or other" in YANG, but it may mean something else in a programming language,
perhaps also depending on locale setting. This is a slippery slope, developers
should not assume they can take a regex from YANG, enclose it in ^..$ and then
feed into a RE-matching function.

Lada

> '[0-9]' MUST be used instead of the '\d' character class shorthand
> that matches Unicode digits in all scripts.
> 
> Pattern statements do not have to strictly restrict numerical values,
> and a simple less specific pattern may be preferable over a more
> complex and precise pattern, e.g. as illustrated in the
> 'ipv4-address-no-zone' example pattern below.
> 
> 
> Or, put in context of the existing text 6087bis text:
> 
> *** Patterns and Ranges
> 
> For string data types, if a machine-readable pattern
> can be defined for the desired semantics, then
> one or more pattern statements SHOULD be present.
> A single quoted string SHOULD be used to specify the pattern,
> since a double-quoted string can modify the content.
> 
> To ensure patterns are easy to read and implement, authors SHOULD
> restrict themselves to the parts of the XML schema regular expression
> language that are common across most regular expression languages.  In
> particular, pattern statements SHOULD avoid using 'character class
> subtraction' (e.g. '[a-z-[aeiou]]').  They SHOULD avoid matching
> unicode properties and blocks (e.g. '\p{L} or \p{IsBasic_Latin}').
> They MAY use the '\d', '\w', '\s' character class shorthands and their
> negated versions, where appropriate, but SHOULD avoid other character
> class shorthands.  To match ASCII digits 0-9 the character class
> '[0-9]' MUST be used instead of the '\d' character class shorthand
> that also matches Unicode digits in all scripts.
> 
> Pattern statements do not have to strictly restrict numerical values,
> and a simple less specific pattern may be preferable over a more
> complex and precise pattern, e.g. as illustrated in the
> 'ipv4-address-no-zone' example pattern below.
> 
> The following typedef from ^RFC6991^ demonstrates the proper
> use of the "pattern" statement:
> 
>      typedef ipv4-address-no-zone {
>        type inet:ipv4-address {
>          pattern '[0-9\.]*';
>        }
>        ...
>      }
> 
> For string data types, if the length of the string
> is required to be bounded in all implementations,
> then a length statement MUST be present.
> 
> The following typedef from ^RFC6991^ demonstrates the proper
> use of the "length" statement:
> 
>      typedef yang-identifier {
>        type string {
>          length "1..max";
>          pattern '[a-zA-Z_][a-zA-Z0-9\-_.]*';
>          pattern '.|..|[^xX].*|.[^mM].*|..[^lL].*';
>        }
>        ...
>      }
> 
> For numeric data types, if the values allowed
> by the intended semantics are different than
> those allowed by the unbounded intrinsic data
> type (e.g., 'int32'), then a range statement SHOULD be present.
> 
> The following typedef from ^RFC6991^ demonstrates the proper
> use of the "range" statement:
> 
>      typedef dscp {
>        type uint8 {
>           range "0..63";
>        }
>        ...
>      }
> 
> Thanks,
> Rob
> 
> 
> On 05/09/2017 22:37, Lou Berger wrote:
> > Rob,
> > 
> > (as chair)
> > On 9/5/2017 1:17 PM, Robert Wilton wrote:
> > > However, I have thrown in the towel on my regex crusade.
> > 
> > I'm sorry, I've lost the thread here a bit. in order to guage consensus
> > on this topic, it would be helpful to send the latest text that you are
> > proposing for inclusion in the the bis.  If you are willing to do these,
> > we can poll to see if there is/is not support for inclusion of this
> > text.  Are you willing, i.e., can you send the current proposed text change?
> > 
> > Thank you,
> > Lou
> > 
> > .
> > 
> 
> 
-- 
Ladislav Lhotka
Head, CZ.NIC Labs
PGP Key ID: 0xB8F92B08A9F76C67