Re: [netmod] Potential additions to rfc6087bis: RegEx guidelines

Robert Wilton <rwilton@cisco.com> Tue, 05 September 2017 17:17 UTC

Return-Path: <rwilton@cisco.com>
X-Original-To: netmod@ietfa.amsl.com
Delivered-To: netmod@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 595D2132DD6 for <netmod@ietfa.amsl.com>; Tue, 5 Sep 2017 10:17:17 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -14.5
X-Spam-Level:
X-Spam-Status: No, score=-14.5 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_HI=-5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, USER_IN_DEF_DKIM_WL=-7.5] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cisco.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 01-RqYaM66Ck for <netmod@ietfa.amsl.com>; Tue, 5 Sep 2017 10:17:14 -0700 (PDT)
Received: from aer-iport-2.cisco.com (aer-iport-2.cisco.com [173.38.203.52]) (using TLSv1.2 with cipher DHE-RSA-SEED-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 4500A132D9C for <netmod@ietf.org>; Tue, 5 Sep 2017 10:17:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cisco.com; i=@cisco.com; l=10516; q=dns/txt; s=iport; t=1504631834; x=1505841434; h=subject:to:references:from:message-id:date:mime-version: in-reply-to:content-transfer-encoding; bh=u75SwsWUwi3sWhG5TdInqjwEltQrEzB/JqOO9rEbPq8=; b=B00n4+o/WbLQlgEyiTd7shmeO946pA/pvnITKlcUTBf4InruSP83ljSA d8jwJ8pDp8oTkJyonzj/Q5idn36DouS2xPpzUYT1NKKkYVQFa0mreQdSS p6RBKULvWqwFkclIYrjwNt9W8DFKnSXk9JPyLJ2n54aF3K2rxegKvybyv A=;
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: A0CsAgCJ2q5Z/xbLJq1UBwMZAQEBAQEBAQEBAQEHAQEBAQGEPoEVg3eLFJB8IneUS4J4ChgLhExPAoR1FAECAQEBAQEBAWsohRgBAQEBAgEBASEPAQU2GQIJAhAIAgImAgIbDDAGAQwGAgEBEAeKDggQlEGdZoInizMBAQEBAQEBAQIBAQEBAQEBAQEBAR0FgQiCHYNQgWMrC4I9NYRKTCaCTIJhBaB0h1uMdotUhx2NV4Qcgw0DBgUCGYE5NiGBDTIhCBwVSYccPzYBixEBAQE
X-IronPort-AV: E=Sophos;i="5.41,480,1498521600"; d="scan'208";a="654419012"
Received: from aer-iport-nat.cisco.com (HELO aer-core-4.cisco.com) ([173.38.203.22]) by aer-iport-2.cisco.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 05 Sep 2017 17:17:09 +0000
Received: from [10.63.23.66] (dhcp-ensft1-uk-vla370-10-63-23-66.cisco.com [10.63.23.66]) by aer-core-4.cisco.com (8.14.5/8.14.5) with ESMTP id v85HH9oI002966; Tue, 5 Sep 2017 17:17:09 GMT
To: Ladislav Lhotka <lhotka@nic.cz>, netmod@ietf.org
References: <f7151a6b-9deb-52ad-62a9-78b29a552540@cisco.com> <20170830102902.2n5q6rgq2x2dxfq2@elstar.local> <e8482a9c-cba3-28e2-9ffa-ec5eb5c1c0a4@cisco.com> <20170830123156.cssrg5kklpo67fie@elstar.local> <CABCOCHTtN611FO2ov2kTLtZx-Q3=tzgH7Xk9uGvFUD1WuyMZyw@mail.gmail.com> <b13c5e9a-e9f9-96e9-8823-0402fb74af09@cisco.com> <1504223854014.55228@Aviatnet.com> <847e5bf9-7b3d-9ff8-9954-970f32a2094c@cisco.com> <20170902073342.xoziwor4tdr5bipw@elstar.local> <D5D00209.C5C67%acee@cisco.com> <20170902112832.ymorfgdthobeio6q@elstar.local> <CABCOCHTC2MhBu0Zu44Z=f+J04HiENjQR+J0Sxy-arjcDmBHb_A@mail.gmail.com> <1e95ba5d-7aa2-e08f-56f9-27aa70822a11@cisco.com> <1504537140.5874.38.camel@nic.cz> <f0ddf7bd-c249-389f-e34b-0b901697307e@cisco.com> <1504629352.7175.40.camel@nic.cz>
From: Robert Wilton <rwilton@cisco.com>
Message-ID: <8af6041d-7cd5-9608-70b4-7cffc4f884f8@cisco.com>
Date: Tue, 05 Sep 2017 18:17:09 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0
MIME-Version: 1.0
In-Reply-To: <1504629352.7175.40.camel@nic.cz>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Content-Language: en-US
Archived-At: <https://mailarchive.ietf.org/arch/msg/netmod/fesAjADeysgI1bRiULKv4289Mfg>
Subject: Re: [netmod] Potential additions to rfc6087bis: RegEx guidelines
X-BeenThere: netmod@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: NETMOD WG list <netmod.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/netmod>, <mailto:netmod-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/netmod/>
List-Post: <mailto:netmod@ietf.org>
List-Help: <mailto:netmod-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/netmod>, <mailto:netmod-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 05 Sep 2017 17:17:17 -0000


On 05/09/2017 17:35, Ladislav Lhotka wrote:
> Robert Wilton píše v Po 04. 09. 2017 v 17:07 +0100:
>> Hi Lada,
>>
>> On 04/09/2017 15:59, Ladislav Lhotka wrote:
>>> Robert Wilton píše v Po 04. 09. 2017 v 15:05 +0100:
>>>> Hi Andy,
>>>>
>>>> On 02/09/2017 17:46, Andy Bierman wrote:
>>>>> On Sat, Sep 2, 2017 at 4:28 AM, Juergen Schoenwaelder <j.schoenwaelder@j
>>>>> acobs-university.de> wrote:
>>>>>> On Sat, Sep 02, 2017 at 10:39:57AM +0000, Acee Lindem (acee) wrote:
>>>>>>> This is not an effort to change or bifurcate the YANG 1.1. It is
>>>>>>> simply to
>>>>>>> RECOMMEND a proper subset of XSD pattern that is more portable.
>>>>>>>
>>>>>> If you implement YANG as it is defined, pattern are portable. Given
>>>>>> this, I do not understand the notion of 'more portable'.
>>>>>>
>>>>>> Anyway, it seems that those who want a more portable subset do not
>>>>>> even agree on what that subset is. Perhaps people pushing for this
>>>>>> should go and write an I-D that explains why a 'more portable' subset
>>>>>> is needed (which problems are we fixing), that defines such a 'more
>>>>>> portable subset', and which includes the reasoning how the subset has
>>>>>> been determined.
>>>>>>
>>>>>>
>>>>> I do not agree that the YANG pattern contains a string that is both a
>>>>> POSIX and XSD regular expression.
>>>>> The RFC is very clear it contains an XSD expression. Pretending it is
>>>>> both is a hack that does not even seem
>>>>> to work 100%, so it is not reliable.
>>>>    I am not suggesting that the YANG pattern is both a POSIX and XSD
>>>> regular expression.
>>>>
>>>> I am only suggesting that the guidelines recommend that authors use a
>>>> subset of XSD, to make it easier to programmatically *convert* the 'XSD
>>>> subset compliant regular expression' into a functionally equivalent
>>>> regular expression for whatever regular expression engine the tooling
>>>> decides to use.
>>> And that's the point, I think: each developer needs to get a library
>>> function so
>>> as to translate the XSD pattern into a native regex of whatever programming
>>> language he/she is currently using. So I guess what we really need is to
>>> identify libraries for common languages that do it correctly - or write
>>> simple
>>> translators ourselves if none is available.
>> Yes, exactly.
>>
>> Looking at http://www.regular-expressions.info/ then XML RE does look
>> like a good standard choice of RE language for YANG pattern statements
>> because it is generally one of the most basic RE languages, and hence it
>> should be feasible to convert an XML RE into a form usable by most RE
>> languages.
> Yes, and the XSD RE language was also designed for pretty much the same purpose
> (data type system).
>
>> But converting some parts of the XML RE syntax would probably be laborious:
> Unicode support is of course hairy but since YANG permits it in the string type
> it makes sense that the pattern language follow suit.
>
> RE flavours used in modern programming languages support Unicode, so the
> translation should be doable (if it hasn't been done yet).
Yes. POSIX extended regex (that one proposed by OpenConfig) is the odd 
one out here because it doesn't support unicode.

Still I haven't seen any standards based RFC or IETF draft YANG models 
that need to match either unicode properties or blocks.  The IPv4/v6 
zone address uses them, but I suspect that '\w' would have been sufficient.

>
>> 1) E.g. the unicode property '\p{Nd}' that is equivalent to '\d' matches
>> 590 characters
>> (http://www.fileformat.info/info/unicode/category/Nd/list.htm). There
>> are approx 32 unicode properties, presumably these could also be
>> extended over time as well.
>> 2) There are currently 105 unicode blocks, which each block is a
>> discrete range of characters (e.g. \p{InTibetan}: U+0F00–U+0FFF)
>> 3) Handling the character class subtraction is also possible, but
>> probably tedious to implement, since it requires the translation to
>> fully understand the set of characters in the character class so it can
>> form an equivalent character class without any subtractions.
> But now with the "invert-match" modifier in YANG 1.1, implementations have to be
> able to perform such set differences anyway, right?
No.  Character class subtraction applies to a single character class 
match in the expression.  The "invert-match" applies to the whole regex 
check.  The same regex check can be performed and the boolean result 
reversed.

>
>> These were the three parts of the XML RE that I was hoping to discourage
>> in the YANG author guidelines so that performing a translation is much
>> easier.  Spotting these 3 parts in the regex should be simple, so the
>> translation would still be robust, even if not complete.
> I believe that tools intended for general use should follow the YANG spec
> literally.
I don't fully agree.  I think that they only need to cover the parts of 
the YANG spec for the models that they are using (or might use). If 
nobody uses Unicode blocks then it doesn't really matter whether a given 
tool supports them or not.  It is always possible to caveat and add 
support for the missing bits later.  E.g. if I was writing a bespoke 
XPATH implementation for YANG then there is probably quite a lot of the 
XPATH spec that I would also leave out as well, and just concentrate on 
the parts that people actually use, or are likely to use.

>
>> There are other conversions that may also need to be performed
>> (depending on the target RE engine):
>> 1) Character class shorthands (e.g. \d, \w) need to be converted to
>> represent the Unicode set equivalent, since for a lot of engines they
>> only match ASCII characters.  For '\s' it must match ASCII whitespace only.
> I think they should mean exactly what XSD spec says they mean.
Yes.  I agree.  I'm only listing that conversions are likely necessary 
to convert an XSD RE into one of the other standard RE implementations.

>
>> 2) If the engine supports greedy alternation (e.g. POSIX basic/extended
>> regex), then alternations need to be converted to an eager form if required.
> Yes, and this is a subtle point that could otherwise be easily overlooked.
>
>> 3) The syntax for escaping characters seems to differ in XML RE from
>> other common languages.
>> 4) Linebreak match handling seems to differ.
> 3 and 4 are IMO not a big deal.
But they do matter to avoid "Tools and libraries would then differ in 
the degree of sloppiness and possibly give different results, which is 
not good."

>
>> These conversions would need to be done regardless, but would seem to be
>> much quicker/simpler to implement than the ones above.
> Tools and libraries would then differ in the degree of sloppiness and possibly
> give different results, which is not good.
Sorry, I don't get this last point.

However, I have thrown in the towel on my regex crusade.

But I still suspect that most model authors/readers are likely to get 
the usage of '\d' wrong ...  but perhaps it doesn't really matter.

Thanks,
Rob


>
> Lada
>
>> Thanks,
>> Rob
>>
>>
>>>> E.g. this seems to be the approach used by "libyang" that uses libpcre as
>>>> the backend RE library rather than libxml.  Unfortunately, I think that
>>>> the libyang library would currently fail if the pattern statement
>>>> contained "[[A-Z]-[P-R]]" because it looks like the PCRE2 language does
>>>> not support character class subtraction.  ACAICT, no standard YANG modules
>>>> currently support character class subtraction, so the authors of libyang
>>>> have a choice here:
>>> Note that your example is incorrect, it should be [A-Z-[P-R]]. FWIW, Python
>>> module PyXB (that I used in Yangson library) does support this.
>>>
>>> Lada
>>>
>>>>     (i) write a block of code that most likely nobody is going to use, or
>>>>     (ii) document the limitation, spot character class subtraction in the
>>>> regex, and flag that it is not supported (or perhaps just ignore it).
>>>>
>>>>
>>>>> If the community wants to support both XSD and POSIX expressions, then
>>>>> the proper engineering
>>>>> solution is to introduce a new statement that is defined to contain a
>>>>> POSIX expression.
>>>>> This can be done with a YANG extension now and added to YANG 2.0 later.
>>>>    I think that this is an inferior solution:
>>>> - there are many languages that YANG tools could be written in: C/C++,
>>>> Python, Java, Go, Rust, Javascript are all reasonably plausible choices.
>>>> - they all have similar, but with small differences regular expression
>>>> flavours (according to http://www.regular-expressions.info/reference.html)
>>>> .
>>>> - Personally, I see no inherent advantage of the POSIX Extended Regex over
>>>> XML RE.   In fact, given that it doesn't support Unicode at all, it would
>>>> seem to be a somewhat strange choice for a second pattern statement.
>>>> - Nor does it seem pragmatic to introduce lots of different flavors of
>>>> pattern statements into YANG each supporting a different regex syntax.
>>>>
>>>> I also don't like the solution that every YANG tool maker has to either
>>>> link against libxml2,  or write their own efficient regular expression
>>>> engine.  I'm not convinced that what the world needs is yet more regular
>>>> expression implementations :-)
>>>>
>>>> So, I still see that the better technical solution is always only define
>>>> the pattern statements in XML RE language, but to strongly encourage folks
>>>> to use a subset of that language for standards models (which they appear
>>>> to be doing anyway) to make it easier to covert the regular expression
>>>> into compatible versions for other engines.
>>>>
>>>> Thanks,
>>>> Rob
>>>>
>>>>
>>>>>    
>>>>>> /js
>>>>>>
>>>>> Andy
>>>>>    
>>>>>> --
>>>>>> Juergen Schoenwaelder           Jacobs University Bremen gGmbH
>>>>>> Phone: +49 421 200 3587         Campus Ring 1 | 28759 Bremen | Germany
>>>>>> Fax:   +49 421 200 3103         <http://www.jacobs-university.de/>
>>>>>>
>>>>>> _______________________________________________
>>>>>> netmod mailing list
>>>>>> netmod@ietf.org
>>>>>> https://www.ietf.org/mailman/listinfo/netmod
>>>>>>
>>>>    
>>>> _______________________________________________
>>>> netmod mailing list
>>>> netmod@ietf.org
>>>> https://www.ietf.org/mailman/listinfo/netmod
>>