Re: [openpgp] Possible ambiguity in description of regular expressions: [^][]

Andrew Gallagher <andrewg@andrewg.com> Tue, 05 January 2021 17:11 UTC

Return-Path: <andrewg@andrewg.com>
X-Original-To: openpgp@ietfa.amsl.com
Delivered-To: openpgp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id DEC633A104A for <openpgp@ietfa.amsl.com>; Tue, 5 Jan 2021 09:11:42 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.361
X-Spam-Level:
X-Spam-Status: No, score=-2.361 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, NICE_REPLY_A=-0.262, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=andrewg.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 0vVF4pZCl0eU for <openpgp@ietfa.amsl.com>; Tue, 5 Jan 2021 09:11:41 -0800 (PST)
Received: from xen.andrewg.com (andrewg.com [IPv6:2a01:7e00::f03c:91ff:fe93:aaa]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 759F83A0E8B for <openpgp@ietf.org>; Tue, 5 Jan 2021 09:11:41 -0800 (PST)
Received: from [IPv6:fc93:5820:7375:ee79:1300::1] (fred [IPv6:fc93:5820:7375:ee79:1300::1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (Client did not present a certificate) by xen.andrewg.com (Postfix) with ESMTPSA id A250F5C686 for <openpgp@ietf.org>; Tue, 5 Jan 2021 17:11:39 +0000 (GMT)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=andrewg.com; s=andrewg-com; t=1609866699; bh=FNzivHwn2uGlQKy4Tmlx2B7GEJrR80biQXAdd4UJiFc=; h=To:References:From:Subject:Date:In-Reply-To:From; b=Ai3ICLUBEeaLBvknSZjfFMLnb/oW/0V5Ha596wO5aSD21+WNbHSKqzPLVeB5UFpKS cS1P9vnw8JR8EXDEvyd4nWqKMEN8VFg68Xp/2fKcvKCbWYdIMPUWvso3aHzZGm/75T Lz38xGlEUXaejSa/FxSt0s3ylavS0T+P+/kAlHM8U4QsuRg4KgNQSW7YZyw6AwqMhK qmuFiMuOzQy+22DvI5Ss4cxdJrDXif9ZAdMPTs69We2BHHRE71COw7AhdfRsUf9Ep7 yB0r09zPwzzpViMTg9rN8YrK1hodtK6OWUFBfEhYHyxFCGl8SjK/Sk/UKxb3rcRuxw 53D/G9MMRSlEQ==
To: openpgp@ietf.org
References: <87r1nguquq.wl-neal@walfield.org> <87tusbuwzp.fsf@fifthhorseman.net> <87mtxzv7mr.wl-neal@walfield.org> <877dor8kl1.fsf@fifthhorseman.net>
From: Andrew Gallagher <andrewg@andrewg.com>
Message-ID: <87456fad-06cd-6605-b5d1-ea5ac49c9ee4@andrewg.com>
Date: Tue, 05 Jan 2021 17:11:36 +0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.1
MIME-Version: 1.0
In-Reply-To: <877dor8kl1.fsf@fifthhorseman.net>
Content-Type: multipart/signed; micalg="pgp-sha256"; protocol="application/pgp-signature"; boundary="AW7ErSGFj1DX559FD7YDOHp6YNp6YRRN1"
Archived-At: <https://mailarchive.ietf.org/arch/msg/openpgp/MiCSgvOoxSCVf8UiynDCvOpY2HI>
Subject: Re: [openpgp] Possible ambiguity in description of regular expressions: [^][]
X-BeenThere: openpgp@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Ongoing discussion of OpenPGP issues." <openpgp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/openpgp>, <mailto:openpgp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/openpgp/>
List-Post: <mailto:openpgp@ietf.org>
List-Help: <mailto:openpgp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/openpgp>, <mailto:openpgp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 05 Jan 2021 17:11:43 -0000

On 05/01/2021 15:33, Daniel Kahn Gillmor wrote:
> But beyond the wordsmithing, if anyone thinks that Neal's interpretation
> (or my proposed clarification) is actually wrong or problematic, please
> speak up!

The original proposed change: "A range is a non-empty sequence of 
characters enclosed in []" is clear and (mostly) effective, so IMO 
should be adopted despite the fact that it is insufficient in itself.

Now, consider the remaining corner cases:

[]]	: matches a closing square bracket
[^]]	: matches anything other than a closing square bracket
[[]	: matches an opening square bracket
[^[]	: matches anything other than an opening square bracket
[][]	: matches either square bracket
[^][]	: matches anything other than either square bracket
[^]	: is an incorrectly nested sequence

To tackle the insufficiency, I propose an additional change:

-If the sequence begins with '^', it matches any single character not 
from the rest of the sequence.
+If the sequence begins with '^', it matches any single character not 
from the rest of the sequence, which must then contain at least one 
further character following the '^'.

When read alongside "To include a literal ']' in the sequence, make it 
the first character (following a possible '^')", this should be 
sufficient to cover all corner cases.

(I considered "... the rest of the sequence, which must be non-empty", 
but it is unclear whether "which" refers to "the sequence" or "the rest 
of the sequence")

However...

We should probably also explicitly note how to negate the special 
meaning of '^':

+To include a literal '^', locate it somewhere other than the first 
character of the sequence.

Now, this doesn't cover the (contrived) case where we may want to use a 
literal '^' as the beginning of an ASCII range:

[^-~]

But if absolutely necessary, one could refactor:

[_-~^]

While we're at it, we should also clarify range inclusivity:

-this is shorthand for the full list of ASCII characters between them
+this is shorthand for them and the full list of ASCII characters 
between them

Also, many regex engines support backslash-escaping within a character 
class. Does RFC4880 support this or not? My reading is that it doesn't, 
but it may be worth explicitly clarifying this also (even though 
backslash escaping would be a more elegant solution to [^-~]).

Is there anything to be said for referring out to an external regex 
definition instead of reinventing the wheel? :-)

-- 
Andrew Gallagher