[Jsonpath] Comments on I-regexp

Michael Kay <mike@saxonica.com> Sat, 14 October 2023 01:04 UTC

Return-Path: <mike@saxonica.com>
X-Original-To: jsonpath@ietfa.amsl.com
Delivered-To: jsonpath@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5F7C6C151547 for <jsonpath@ietfa.amsl.com>; Fri, 13 Oct 2023 18:04:48 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.107
X-Spam-Level:
X-Spam-Status: No, score=-7.107 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=saxonica.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id hUHf0hbyRYgh for <jsonpath@ietfa.amsl.com>; Fri, 13 Oct 2023 18:04:44 -0700 (PDT)
Received: from saxonica.positive-dedicated.net (saxonica.positive-dedicated.net [185.27.21.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 69504C151533 for <jsonpath@ietf.org>; Fri, 13 Oct 2023 18:04:43 -0700 (PDT)
Received: from smtpclient.apple (cpc160115-rdng30-2-0-cust150.15-3.cable.virginm.net [86.19.35.151]) by saxonica.positive-dedicated.net (Postfix) with ESMTPSA id 1B58820502 for <jsonpath@ietf.org>; Sat, 14 Oct 2023 02:04:39 +0100 (BST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=saxonica.com; s=202306; t=1697245479; bh=5hbIpStHVT9+SbuODbkg3GycOWBC7ZIIckWUVm6Y9kI=; h=From:Subject:Date:To:From; b=LTzvBGHASzVuYbI5ExF6OkPffWcn1Pu0Zz3jbdiv8wjEDQ/F36x3aSDjzMrBsj81v HQEBjttEYP+V8Z3NZqyhg/nrreI/YEpg5OCuVVi/BZyeLNWOW9i6mYVCdPpdq0zyHw ETSlM+pAdBOpt8z2vK2YjKig9HypxtYFCTmtca95CE2rZ6CRd6lDMBiGEfk6sHSSBb nkjlDdwJsVCRmTxvijmDgcYvOYn/U6VsjkpadjtVhDIqX5F71j8yzHAlrOHmQ5bTld ISV/sSlGeSvg49G4NJUjM14j324GLBKol6qie87Q//SNYzJHnM7gNEz6Varp8cZJqu uZelEJ/qT9QlA==
From: Michael Kay <mike@saxonica.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3731.700.6\))
Message-Id: <0194D248-0148-445A-A80B-6D34206B84C8@saxonica.com>
Date: Sat, 14 Oct 2023 00:29:12 +0100
To: jsonpath@ietf.org
X-Mailer: Apple Mail (2.3731.700.6)
Archived-At: <https://mailarchive.ietf.org/arch/msg/jsonpath/Ez2m3aeWEdy2tNERbpx9p7m6aLA>
Subject: [Jsonpath] Comments on I-regexp
X-BeenThere: jsonpath@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: A summary description of the list to be included in the table on this page <jsonpath.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/jsonpath>, <mailto:jsonpath-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/jsonpath/>
List-Post: <mailto:jsonpath@ietf.org>
List-Help: <mailto:jsonpath-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/jsonpath>, <mailto:jsonpath-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 14 Oct 2023 01:04:48 -0000

Excellent work.

I find it disappointing that the normative reference is to XSD 1.0 rather than 1.1, since the 1.0 spec contains some quite serious bugs fixed in 1.1. The fact that 1.0 is more widely implemented seems irrelevant; indeed it seems harmful, since I-RegExp implementors may turn to XSD 1.0 implementations for guidance, and it is known that XSD 1.0 implementors have found different ways of fixing the bugs in the spec.

I note that there is no mechanism for identifying Unicode characters by their numeric codepoint. This is not needed in XSD, because the XML escape convention (e.g. `&#x10000;`) is available. I can imagine that in other contexts this could make it quite difficult to write readable regexps, unless some host-language escape mechanism is available. Perhaps the RFC should suggest a convention for denoting "visually indistinctive" characters such as NBSP when I-Regexps are used in IETF specifications?

It is stated that the only functionality supported is string matching; it may be worth mentioning, for those unfamiliar with XSD, that this means anchored string matching.

My reading of the ABNF is that standalone hyphens have been restricted to appear at the start or end of a character group. This is a good solution to a problem that has been very troublesome in XSD; it is worth highlighting this as one of the differences from XSD.

Specifying the syntax directly in the RFC, and the semantics by reference to a different specification that uses a different grammar, creates something of a disconnect. Essentially I think you're expected first to check that the I-RegExp parses according to the RFC grammar, then to reparse it according to the XSD grammar, which yields constructs that are referenced in the XSD semantics. The XSD semantics uses terms like CharGroupPart that appear in the XSD grammar but not in the RFC grammar; it also uses terms like quantifier that appear in both grammars but with different definitions. I don't think there are any insuperable problems here but it does make the detail quite hard to follow. Would it be better to lift the semantics out of XSD and into the RFC?

Michael Kay
Saxonica