[Jsonpath] #70 Regexps (was: Re: Draft minutes, consensus points, and actions from IETF 112)

Carsten Bormann <cabo@tzi.org> Sun, 14 November 2021 21:14 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: jsonpath@ietfa.amsl.com
Delivered-To: jsonpath@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 094BC3A0784 for <jsonpath@ietfa.amsl.com>; Sun, 14 Nov 2021 13:14:58 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id i7EukZVcK_ES for <jsonpath@ietfa.amsl.com>; Sun, 14 Nov 2021 13:14:53 -0800 (PST)
Received: from gabriel-smtp.zfn.uni-bremen.de (gabriel-smtp.zfn.uni-bremen.de [IPv6:2001:638:708:32::15]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 77E243A0776 for <jsonpath@ietf.org>; Sun, 14 Nov 2021 13:14:52 -0800 (PST)
Received: from [192.168.217.118] (p5089a436.dip0.t-ipconnect.de [80.137.164.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gabriel-smtp.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4HslT46dw3z2xrl; Sun, 14 Nov 2021 22:14:48 +0100 (CET)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.7\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <03E1325D-268F-4380-A5D0-F45E2BE61360@gmail.com>
Date: Sun, 14 Nov 2021 22:14:48 +0100
X-Mao-Original-Outgoing-Id: 658617288.448471-672cc44b772be5c810eed94b4b42204e
Content-Transfer-Encoding: quoted-printable
Message-Id: <ECCA2C65-8534-4F0E-B3AE-A51A737B325C@tzi.org>
References: <03E1325D-268F-4380-A5D0-F45E2BE61360@gmail.com>
To: jsonpath@ietf.org
X-Mailer: Apple Mail (2.3608.120.23.2.7)
Archived-At: <https://mailarchive.ietf.org/arch/msg/jsonpath/8-oOOvcyeFfVlapQrAYMqja9YkA>
Subject: [Jsonpath] #70 Regexps (was: Re: Draft minutes, consensus points, and actions from IETF 112)
X-BeenThere: jsonpath@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: A summary description of the list to be included in the table on this page <jsonpath.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/jsonpath>, <mailto:jsonpath-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/jsonpath/>
List-Post: <mailto:jsonpath@ietf.org>
List-Help: <mailto:jsonpath-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/jsonpath>, <mailto:jsonpath-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 14 Nov 2021 21:14:58 -0000

On 2021-11-14, at 12:35, James <james.ietf@gmail.com> wrote:
> 
> * #70 - Have discussion on list with known options

As a reminder, my slides had these options (renumbered for easier reference, and expanded a bit):

1. Select (define) one regular expression flavor
2. Provide a way to plug in regular expressions (of different flavors)
3. No regexps in base RFC (but keep an extension point)

1. further splits into:

1a. Select *a version of* ECMAScript (parsing/searching RE)
1b. Select W3C XSD RE (matching RE)
1c. Build "modest subset" (e.g., iregexp)

Since we don’t have a consensus or even a majority among the implementations, we are free to do the right thing, if we can pull that off.

As you know, I have been exploring 1c.

I submitted an updated version -01 of draft-bormann-jsonpath-iregexp.
As in -00, I’m using W3C XSD RE as a base, as these are actual regular expressions, amenable to implementation techniques that are less susceptible to DoS problems than the Perl/PCRE/ECMAscript dialect.

Apart from character class subtraction and the exact semantics of Multi-Character escapes (\s \d \w etc., outside and inside of character class expressions), W3C XSD RE are pretty much a consensus subset of the various regular expression dialects, except that they are in the form of matching expressions (no anchors needed) instead of parsing expressions.  

I believe the spec should have conversion instructions for implementers that just want to use the regexp engine they happen to have handy.  Because of the consensus subset nature of W3C XSD RE, these instructions are relatively straightforward (copy, and surround by the anchors the target flavor happens to use).

I tried to add conversion instructions for Multi-Character escapes (\s \d \w and \S \D \W, leaving out the \c and \i that nobody except W3C has).
As you can see when looking at the diff, the result is not pretty when it comes to double negation in character classes.  Maybe a pathological case, but a bit of a trap, and maybe not that useful anyway as the W3C semantics is quite different from the PCRE ones.
So I’m leaning towards a -02 that does not have Multi-Character escapes.
(None of the regexps I found in RFCs uses them, and not having them also happens to be pretty much what the json-schema.org drafts seem to converge at.)

Grüße, Carsten

Status:   https://datatracker.ietf.org/doc/draft-bormann-jsonpath-iregexp/
Html:     https://www.ietf.org/archive/id/draft-bormann-jsonpath-iregexp-01.html
Diff:     https://www.ietf.org/rfcdiff?url2=draft-bormann-jsonpath-iregexp-01