Re: [Cbor] CDDL: Solving the regexp issue

Sean Leonard <dev+ietf@seantek.com> Tue, 20 March 2018 13:22 UTC

Return-Path: <dev+ietf@seantek.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5D492124B17 for <cbor@ietfa.amsl.com>; Tue, 20 Mar 2018 06:22:39 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.6
X-Spam-Level:
X-Spam-Status: No, score=-2.6 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id VOII5nc4Wk1Z for <cbor@ietfa.amsl.com>; Tue, 20 Mar 2018 06:22:37 -0700 (PDT)
Received: from smtp-out-1.mxes.net (smtp-out-1.mxes.net [67.222.241.250]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 32FD61270AE for <cbor@ietf.org>; Tue, 20 Mar 2018 06:22:37 -0700 (PDT)
Received: from [192.168.123.7] (cpe-76-90-60-238.socal.res.rr.com [76.90.60.238]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTPSA id 0C8152750B for <cbor@ietf.org>; Tue, 20 Mar 2018 09:22:35 -0400 (EDT)
To: cbor@ietf.org
References: <276C1041-F1C9-426B-BAF6-89D2BFC3A8F8@tzi.org>
From: Sean Leonard <dev+ietf@seantek.com>
Message-ID: <5c3d0c5d-b0cd-4672-f037-fd181c4d9134@seantek.com>
Date: Tue, 20 Mar 2018 06:22:14 -0700
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0
MIME-Version: 1.0
In-Reply-To: <276C1041-F1C9-426B-BAF6-89D2BFC3A8F8@tzi.org>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Content-Language: en-US
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/QIB-0r4l8Xw2sszWWDIelba5Clo>
Subject: Re: [Cbor] CDDL: Solving the regexp issue
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 20 Mar 2018 13:22:39 -0000

I have been meaning to respond to this, sorry for the long lag time.

I disagree strongly with the dependency on XSD Regular Expressions.
The primary technical argument is that XSD REs incorporate the 
definition of "XmlChar" (see also "normal character") aka XML character, 
which is extraordinarily convoluted (and does include certain ranges of 
Unicode code points, yet does not actually include all Unicode code 
points). Another such instance of Unicode accretion is "IsBlock", which 
allows for regular expressions that go through the Unicode Blocks.txt, 
which in combination with other Unicode source material, is not a 
reasonable dependency for smaller devices.

~~Proposal~~
My (tentative) proposal instead is to focus on POSIX Extended Regular 
Expressions instead:
http://www.boost.org/doc/libs/1_50_0/libs/regex/doc/html/boost_regex/syntax/basic_extended.html
https://en.wikibooks.org/wiki/Regular_Expressions/POSIX-Extended_Regular_Expressions
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_04

Technical arguments:
* POSIX EREs appear to be a strict subset of Perl regular expression and 
PCRE technology. For example, POSIX character classes [:digit:] == \d 
are supported by PCRE. XSD does not support this.
* AFAICT, there are no droppings from Unicode in the way that exist in 
XSD REs.
* It means that the text can still refer to PCRE as an expectation 
(which aligns with RFC 7049), while defining a normative minimum 
conformance requirement.

Non-technical arguments:
* There is a valid normative reference for it.
* POSIX ERE as a normative base doesn't actually change CDDL "v1.0" from 
"v0.x". Changing to XSD would actually change things, which (in my view) 
would then have to be changed back.

Furthermore, there was impetus at the November meeting to remove syntax 
elements such as regular expressions from quoted strings, making them 
first-class parts of the grammar. This promotes readability and allows 
syntax highlighters the ability to highlight and validate such parts.

So the text (taken from draft-greevenbosch-appsawg-cbor-cddl-11, prior 
to the XSD change) is proposed to read:

3.8.3.  Control operator .regexp

    A ".regexp" control indicates that the text string given as a target
    needs to match the regular expression given as a value in the
    control type. POSIX Extended Regular Expressions are always supported.
    PCRE and its full syntax are recommended as an implementation choice,
    since [RFC7049] itself specifies PCRE.

                  nai = tstr .regexp /^[:word:]+@[:word:]+(.[:word:]+)+$/

                    Figure 9: Control with a POSIX ERE / PCRE regexp

    The CDDL tool proposes:

                        "N1@CH57HF.4Znqe0.dYJRN.igjf"


Respectfully submitted,

Sean

On 1/24/2018 12:04 PM, Carsten Bormann wrote:
> At IETF 100, we had a another discussion about regular expressions.
> I don’t think we went home with a clear idea of how to solve this problem.
> Giving up and just not doing regular expression in the first version of CDDL certainly was one choice.
> But, since, there has been time to analyze the issue some more.
>
> Here is a proposal for addressing the regexp issue:
>
> https://cbor-wg.github.io/cddl/matching/draft-ietf-cbor-cddl.html#rfc.section.3.8.3
>
> Why use XSD REs?
>
> * There is a valid normative reference for them.
> * There is precedent for using XSD REs in IETF specifications, e.g. in YANG (RFC 7950).
> * They actually have been designed for the kind of use we are employing them for (albeit in an XML context).
> * There are no undesirable UTF-16 droppings in XSD REs, as there would have been in JavaScript REs.
>
> The solution proposed works quite well and allows us to have a reasonable .regexp in the 1.0 CDDL RFC.
> (It is just a tad less straightforward to implement than PCRE was.  Oh well.)
>
> We can always add more controls for other RE types later.  (This also provides a powerful incentive to continue coupling REs with the main extension point for the CDDL language: controls.)
>
> Grüße, Carsten
>
> _______________________________________________
> CBOR mailing list
> CBOR@ietf.org
> https://www.ietf.org/mailman/listinfo/cbor