Re: [netmod] Potential additions to rfc6087bis: RegEx guidelines

Robert Wilton <rwilton@cisco.com> Wed, 30 August 2017 09:16 UTC

Return-Path: <rwilton@cisco.com>
X-Original-To: netmod@ietfa.amsl.com
Delivered-To: netmod@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 979CA132715 for <netmod@ietfa.amsl.com>; Wed, 30 Aug 2017 02:16:35 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -14.5
X-Spam-Level:
X-Spam-Status: No, score=-14.5 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5, SPF_PASS=-0.001, USER_IN_DEF_DKIM_WL=-7.5] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cisco.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id H6r-s2942x2h for <netmod@ietfa.amsl.com>; Wed, 30 Aug 2017 02:16:33 -0700 (PDT)
Received: from aer-iport-4.cisco.com (aer-iport-4.cisco.com [173.38.203.54]) (using TLSv1.2 with cipher DHE-RSA-SEED-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 938F21321C4 for <netmod@ietf.org>; Wed, 30 Aug 2017 02:16:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cisco.com; i=@cisco.com; l=13178; q=dns/txt; s=iport; t=1504084592; x=1505294192; h=subject:to:cc:references:from:message-id:date: mime-version:in-reply-to; bh=0kdOcE2sH2yaCjm+fNEfYU7fyjUK0S7J+rwG8kNj7cc=; b=VYAsk2QzujC23kq9cgTMFE7ohsgHOB/AP7affVyOiilwzLtzhca4rPFN JCDbjECY6DBCqfAvcg0Sz0nGykc/6834uKK5dHcSEcWZBzYxdtZHWYpP3 V26RH/fSBmuVQOhioRB6j4yOIfIM9HZefIa50BZ1DCpNP7jgJGFPNgRv8 A=;
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: A0CYAgAMgaZZ/xbLJq1eGQEBAQEBAQEBAQEBBwEBAQEBgm8+gRGBFYN3ixKRF5BphT6CEiEBCoRMTwKEXhYBAgEBAQEBAQFrKIUZAQEBAwEBIUsLEAsYJwMCAicfEQYNBgIBAYotEK1Cgicnix8BAQEBAQEBAQEBAQEBAQEBAQEBAQEYBYMqg1CCDoJ9hE+DOYJhBZgsiDyLMokaghKJQCSGdY1QiHImDSRBTDIhCBwVSYRfORyBaD82iC2CQAEBAQ
X-IronPort-AV: E=Sophos;i="5.41,448,1498521600"; d="scan'208,217";a="657122349"
Received: from aer-iport-nat.cisco.com (HELO aer-core-2.cisco.com) ([173.38.203.22]) by aer-iport-4.cisco.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 30 Aug 2017 09:16:30 +0000
Received: from [10.63.23.66] (dhcp-ensft1-uk-vla370-10-63-23-66.cisco.com [10.63.23.66]) by aer-core-2.cisco.com (8.14.5/8.14.5) with ESMTP id v7U9GULW013528; Wed, 30 Aug 2017 09:16:30 GMT
To: Andy Bierman <andy@yumaworks.com>
Cc: Xufeng Liu <Xufeng_Liu@jabil.com>, Per Hedeland <per@tail-f.com>, Ladislav Lhotka <lhotka@nic.cz>, "netmod@ietf.org" <netmod@ietf.org>
References: <BN3PR0201MB0867DAD1212DBA2E88570AD5F1850@BN3PR0201MB0867.namprd02.prod.outlook.com> <20170824060900.u5kcffzvwjr7mmob@elstar.local> <152f24b2-7947-9c76-714c-af226ab3fe91@tail-f.com> <8760ddc676.fsf@nic.cz> <599F0991.7020900@tail-f.com> <BN3PR0201MB0867A248887538077CD5D49FF19B0@BN3PR0201MB0867.namprd02.prod.outlook.com> <20170825125254.6nhnzkrar6fhu7zr@elstar.local> <BN3PR0201MB086796F09BFD77FCD718C21BF19E0@BN3PR0201MB0867.namprd02.prod.outlook.com> <20170828154640.pzg7jfy5uepkb22q@elstar.local> <c8de6140-af50-0a4b-a479-b011a8dfbbe7@cisco.com> <CABCOCHRNt3Tkxy8Ffz3JGgPe-rQYwZ3MTLmD43OQi4P6tZQJmg@mail.gmail.com>
From: Robert Wilton <rwilton@cisco.com>
Message-ID: <f7151a6b-9deb-52ad-62a9-78b29a552540@cisco.com>
Date: Wed, 30 Aug 2017 10:16:30 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0
MIME-Version: 1.0
In-Reply-To: <CABCOCHRNt3Tkxy8Ffz3JGgPe-rQYwZ3MTLmD43OQi4P6tZQJmg@mail.gmail.com>
Content-Type: multipart/alternative; boundary="------------C440BC0D3500F8DDB4E2DAA1"
Content-Language: en-US
Archived-At: <https://mailarchive.ietf.org/arch/msg/netmod/cbpbSvPgvFX73OPDwcSgwDxIyoA>
Subject: Re: [netmod] Potential additions to rfc6087bis: RegEx guidelines
X-BeenThere: netmod@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: NETMOD WG list <netmod.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/netmod>, <mailto:netmod-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/netmod/>
List-Post: <mailto:netmod@ietf.org>
List-Help: <mailto:netmod-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/netmod>, <mailto:netmod-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Aug 2017 09:16:35 -0000

Hi Andy,

What I am suggesting makes it easier for readers, because I am a 
proponent of simpler regular expressions that are easy to read and 
understand.

For example, I wonder how many YANG model readers would immediately 
comprehend what this pattern statement means:

pattern "\p{Sc}\p{Zs}?\p{Nd}+\.\p{Nd}{2}"?

Does allowing such patterns really make it easier for model readers?

The proposes guidelines obviously make it easier (or at least no harder) 
for tool makers.

I agree that there is an minor impact to model writers, but really only 
in the sense that the guidelines would be telling them not to use the 
esoteric options of the XML regex syntax that they probably don't know 
about anyway.

If explicitly putting this in the YANG author guidelines is not liked, 
then another possible option could be a softer recommendation in the 
guidelines RFC, with some more explicit examples of stuff to avoid on an 
YANG FAQ Wiki page.

Thanks,
Rob


On 29/08/2017 17:15, Andy Bierman wrote:
> Hi,
>
> I agree with Juergen that these proposed guidelines are not a good idea.
> The priority order for YANG is (1) readers (2) writers and (3) toolmakers.
> It seems trivial for group (3) to convert the XSD pattern to some 
> other format.
> It seems difficult to train all the people in groups (1) and (2) that 
> there are lots of
> special new rules to learn.
>
>
> Andy
>
>
> On Tue, Aug 29, 2017 at 7:27 AM, Robert Wilton <rwilton@cisco.com 
> <mailto:rwilton@cisco.com>> wrote:
>
>
>     On 28/08/2017 16:46, Juergen Schoenwaelder wrote:
>>     On Mon, Aug 28, 2017 at 12:58:59PM +0000, Xufeng Liu wrote:
>>>     [Xufeng] [0..9] is still compliant with the XSD pattern specified by
>>>     YANG 1.0 and 1.1. Using [0..9] instead of [\d] will make the
>>>     implementations with native POSIX RegEx easier without the need for
>>>     a tool to inspect every element of the RegEx pattern.
>>     Yes, but then \d is legal in YANG (and it is used in a couple of
>>     published RFCs).
>     I entirely agree that YANG regular expressions must be legal XML
>     Schema regular expressions.
>
>     However, I don't think that the majority of YANG implementations
>     are going to want to either use libxml or write their own
>     implementation of the XML RE language.  Instead it is desirable
>     that they can use whatever standard regex implementation comes
>     with their language, or is readily available in a library.
>
>     Most of the pattern statements I see in YANG modules use a basic
>     subset of regular expressions, and hence it looks like they can
>     often be used by most RE engines, perhaps with some trivial tweaks
>     or conversions.  However, there is no formal guidance recommending
>     that pattern statements in standard modules are restricted to a
>     subset of XML RE.
>
>     Hence, ideally I would like 6087bis to state that pattern
>     statements SHOULD also conform to the following additional RE
>     syntax restrictions, which I think should make them easy to
>     convert to most other standard regex implementations (subject to
>     unicode support limitations):
>
>     (1) Allow \d, \D, \s, \S, \w and \W; but not inside character classes.
>     (2) Disallow \i, \c; and their negative equivalents.
>     (3) Disallow character class subtraction (e.g. "[A-Z-[RW]]").
>     (4) Limit the supported unicode categories to only the following
>     8.  Both \p and \P syntax is supported, but not inside character
>     classes:
>       \p{L} or any kind of letter from any language.
>       \p{Ll} a lowercase letter that has an uppercase variant.
>       \p{Lu} an uppercase letter that has a lowercase variant.
>       \p{Z} any kind of whitespace or invisible separator.
>       \p{Zs} a whitespace character that is invisible, but does take
>     up space.
>       \p{Zl} a line separator character U+2028.
>       \p{N} any kind of numeric character in any script.
>       \p{Nd}: a digit zero through nine in any script except ideographic
>     (5) Disallow matching of unicode blocks.
>
>     Thanks,
>     Rob
>
>
>>     Educating _all_ module authors to write [0..9] instead of \d will
>>     likely be more expensive than improving the code of implementations
>>     that did not implement YANG entirely to accept \d.
>>
>>     /js
>>
>
>
>     _______________________________________________
>     netmod mailing list
>     netmod@ietf.org <mailto:netmod@ietf.org>
>     https://www.ietf.org/mailman/listinfo/netmod
>     <https://www.ietf.org/mailman/listinfo/netmod>
>
>