Re: [Cbor] Regular expressions

Carsten Bormann <cabo@tzi.org> Sun, 28 February 2021 20:29 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id AD7723A1BF0 for <cbor@ietfa.amsl.com>; Sun, 28 Feb 2021 12:29:13 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.019
X-Spam-Level:
X-Spam-Status: No, score=-0.019 tagged_above=-999 required=5 tests=[RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 75lmZS6X6Hvr for <cbor@ietfa.amsl.com>; Sun, 28 Feb 2021 12:29:10 -0800 (PST)
Received: from gabriel-vm-2.zfn.uni-bremen.de (gabriel-vm-2.zfn.uni-bremen.de [134.102.50.17]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B2C293A1BEF for <cbor@ietf.org>; Sun, 28 Feb 2021 12:29:09 -0800 (PST)
Received: from [192.168.217.123] (p5089a828.dip0.t-ipconnect.de [80.137.168.40]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gabriel-vm-2.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4DpZjv4w83zyQJ; Sun, 28 Feb 2021 21:29:07 +0100 (CET)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.4\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <4665BD99-C64E-41B4-9FD0-547175B33D9A@cursive.net>
Date: Sun, 28 Feb 2021 21:29:07 +0100
Cc: cbor@ietf.org
X-Mao-Original-Outgoing-Id: 636236947.22886-9b469056a9bd63eeb2d5e4530837b444
Content-Transfer-Encoding: quoted-printable
Message-Id: <B79CC250-9E89-41B4-8136-B9AC96422962@tzi.org>
References: <4665BD99-C64E-41B4-9FD0-547175B33D9A@cursive.net>
To: Joe Hildebrand <hildjj@cursive.net>
X-Mailer: Apple Mail (2.3608.120.23.2.4)
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/pQ6yiCdsJ6GAuoHxEso37kFPxY8>
Subject: Re: [Cbor] Regular expressions
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 28 Feb 2021 20:29:14 -0000

On 2021-02-28, at 20:03, Joe Hildebrand <hildjj@cursive.net> wrote:
> 
> I know that rfc8949 dropped Regular Expressions as tag 35, but they're still defined and usable from rfc7049.

Right.

> That said, what, if anything, are people doing the the regex flags?  I wonder if all this time I should have been encoding
> 
> /foo/g
> 
> As 
> 
> 35("/foo/g")
> 
> Instead of 
> 
> 35("foo")

The intention certainly was to give the RE, not a RE literal.
So the slashes and the g would become part of the RE, and

35(“/foo/g”) = RegExp(“/foo/g”) = /\/foo\/g/

> (Losing the g flag).  I see from https://mailarchive.ietf.org/arch/msg/cbor/txIKHMXRFzNo7oH-eHZigO1L47w/ that Carsten has previously also assumed that we don't transport the slashes.

Right.

> If including the slashes won't interop, and if I'm not the only one that's implemented tag 35, I'd be happy to whip up a new doc to describe this, and register a higher tag number for it.  If anyone is interested, we can have a discussion about
> 
> 35("/foo/g") vs. 35("foo", "g")

I can’t speak about the g flag (it is not actually an RE modifier), but, e.g. for the literal

/foo/i 

its value in PCRE is 

(?i)foo
or
(?i:foo)

so there is no need to carry the modifiers as flags outside the RE.

> And maybe even some sort of info about what kind of regex it is (ECMAscript vs. PCRE, for example).

Definitely.  And for non-PCRE, there may be a need to carry flags outside.
ECMAscript (JavaScript) and PCRE are quite close, but there are other families as well:
RFC 8610 uses W3C Schema (WSD)-types regexes, as does YANG (at least officially, not so much in practice).  These are anchored (good) and support subtraction (exceedingly good), but are stuck with Unicode complexity in \d and \w, even in \s, which are therefore not useful in ASCII protocols.

> I assume a lot of folks are in the "regexes are too hard to interop" camp, in which case I'll take a nice high tag number and everyone else can ignore it.

They are hard, but not “too hard” for a large number of applications.
Enjoy https://www.regular-expressions.info/refmodifiers.html
just for the modifiers :-)
Flags such as g in JavaScript or u and n in Ruby are much harder to express inside the RE.

(But I still prefer ABNF :-)

Grüße, Carsten