Re: [abnf-discuss] Mail regarding draft-seantek-constrained-abnf

Sean Leonard <dev+ietf@seantek.com> Fri, 08 July 2016 19:22 UTC

Return-Path: <dev+ietf@seantek.com>
X-Original-To: abnf-discuss@ietfa.amsl.com
Delivered-To: abnf-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 76949126B6D; Fri, 8 Jul 2016 12:22:27 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.601
X-Spam-Level:
X-Spam-Status: No, score=-2.601 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id EIQg5pgCtn4t; Fri, 8 Jul 2016 12:22:24 -0700 (PDT)
Received: from mxout-07.mxes.net (mxout-07.mxes.net [216.86.168.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 89F94127058; Fri, 8 Jul 2016 12:22:22 -0700 (PDT)
Received: from [10.1.4.132] (unknown [208.77.234.34]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTPSA id 07A9522E253; Fri, 8 Jul 2016 15:22:14 -0400 (EDT)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
From: Sean Leonard <dev+ietf@seantek.com>
In-Reply-To: <92276351-a21c-302c-f0c8-7b4843c9b5f7@alum.mit.edu>
Date: Fri, 08 Jul 2016 12:22:13 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <0781109A-9AFB-42AA-8828-DA5CDF38C377@seantek.com>
References: <20160708121202.A3E5D12D19F@ietfa.amsl.com> <43d3ffea-57de-f7ef-e740-e448564008ed@alum.mit.edu> <bf12cdc4-8d17-d6aa-cc01-5afad19127ac@seantek.com> <92276351-a21c-302c-f0c8-7b4843c9b5f7@alum.mit.edu>
To: Paul Kyzivat <pkyzivat@alum.mit.edu>
X-Mailer: Apple Mail (2.3124)
Archived-At: <https://mailarchive.ietf.org/arch/msg/abnf-discuss/B9cDX1p_VSUO8GBpdBWLUOQoc5k>
Cc: Jonathan Hansford <jonathan@hansfords.net>, "draft-seantek-constrained-abnf@ietf.org" <draft-seantek-constrained-abnf@ietf.org>, "abnf-discuss@ietf.org" <abnf-discuss@ietf.org>
Subject: Re: [abnf-discuss] Mail regarding draft-seantek-constrained-abnf
X-BeenThere: abnf-discuss@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: "General discussion about tools, activities and capabilities involving the ABNF meta-language" <abnf-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/abnf-discuss>, <mailto:abnf-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/abnf-discuss/>
List-Post: <mailto:abnf-discuss@ietf.org>
List-Help: <mailto:abnf-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/abnf-discuss>, <mailto:abnf-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 08 Jul 2016 19:22:27 -0000

> On Jul 8, 2016, at 9:49 AM, Paul Kyzivat <pkyzivat@alum.mit.edu> wrote:
> 
> On 7/8/16 12:23 PM, Sean Leonard wrote:
> 
>>> I have difficulty imagining how you would fit such constraints into
>>> the syntax. The syntax of ABNF is pretty nice for what it does, but it
>>> doesn't leave much room for extension because we are running out of
>>> special characters. (Perhaps the possibilities could be expanded by
>>> using unicode characters. Imagine what could be done with emojis as
>>> operator characters.)
>> 
>> Length constraints are expressible with current ABNF. Generically,
>> suppose you have an identifier comprised of ALPHAs and DIGITs:
>> 
>> identifier = 1*(ALPHA / DIGIT)
>> 
>> Well, if you want to limit identifier to 20 chars, then you can say max
>> twenty:
>> 
>> identifier = 1*20(ALPHA / DIGIT)
>> 
>> As Paul says, this is a repetition constraints, and is easy to analyze
>> with the repeated symbols are fixed-width. But what about variable width?
>> 
>> Well, suppose you have Domain from RFC 5321:
>> 
>>   Domain         = sub-domain *("." sub-domain)
>> 
>>   sub-domain     = Let-dig [Ldh-str]
>>   Let-dig        = ALPHA / DIGIT
>>   Ldh-str        = *( ALPHA / DIGIT / "-" ) Let-dig
>> 
>> 
>> 
>> Now suppose you want to constrain Domain to be 255 chars. Such a
>> constrained domain might be called DNSDomain. Then you can say:
>> 
>>   DNSDomain ^ Domain = 1*255ASCII
> 
> Ah. Interesting!
> 
> 
>> Regarding Paul's Unicode point: it would be nice for ABNF to have an
>> extension that defines the domain of the input symbols, with specific
>> options for: ASCII-only, octets, fixed-bit-width units, UTF-8,
>> UTF-16LE/BE, and Unicode scalar values (0-0xD7FF, 0xE000-0x10FFFF), and
>> Unicode characters (non-characters such as 0xFFFF are excluded). The
>> most important of these are ASCII-only, octets, and Unicode scalar values.
> 
> Note that if you are using quoted strings in ABNF then you are restricted to their mapping to ASCII, which then extends obviously to Unicode.
> 
> I am not well informed about Unicode, but ISTM that if you are using ABNF with it then it makes sense to only do it over Unicode scalar values, while leaving the conversion of those to/from a particular encoding like UTF-8 can be via a pre/post-processor. You need to be a glutton for punishment to define your Unicode-based grammar over octets.

Perhaps. But, that is exactly what RFC 451x (LDAP series) and RFC 3629 (UTF-8 specification) do. The problem is that when you refer to RFC 3629’s ABNF literally, you are operating over octets.

But it is useful to define ABNF over octets when you are dealing with octets, e.g., when you’re defining Unicode encodings themselves!

> And is there any operational difference between defining ABNF over unsigned integers and defining it over Unicode scalar values?

Best to look up the definitions:
http://unicode.org/glossary/

“Technically” if you are in ISO 10646-land, the applicable integers are 0 - (2^32-1). At some point it got limited to 0x10FFFF for various practical reasons, resulting in 1114112 possible code points. Like Bill Gates said, “640K ought to be enough for anybody.”

A Unicode code point is any integer from 0 to 0x10FFFF. But because of the Unicode surrogate technique in the 0xD800-0xDFFF range for UTF-16, the code points 0xD800-0xDFFF can never be assigned to so-called “actual characters”. The “Unicode scalar value” definition carves out a donut hole for these values.

Unfortunately, Unicode scalar value != Unicode characters. That is because some code points are specifically called “noncharacters”, which is to say, they have semantics in the Unicode character set to mean they are not in the character set. There are 66 of them and they are interspersed (although I suppose they can be screened out with bit patterns rather than requiring a big table).

Hence, making ABNF for all of this, is nontrivial. Depending on the use cases, you might want to include them, or you might not want to include them. You could, for example, use them to delimit “strings of Unicode characters” safely, assuming that you really are talking about strings of Unicode characters and not strings of arbitrary integer values that may or may not conform to some version of Unicode.

In my mind, it kind of comes down to, “do you really want to modify a pretty stable standard, ABNF, to incorporate an ever-shifting standard that keeps on giving you more and more 😫?” ABNF is based on ASCII which really hasn’t changed since ’68. ASCII is not without its drawbacks (BEL, anyone?), but it has given us half a century of peace. I’m not against incorporating Unicode, but those are all of the issues.

Sean