Re: [abnf-discuss] Mail regarding draft-seantek-constrained-abnf

Paul Kyzivat <pkyzivat@alum.mit.edu> Fri, 08 July 2016 22:39 UTC

To: Sean Leonard <dev+ietf@seantek.com>
References: <20160708121202.A3E5D12D19F@ietfa.amsl.com> <43d3ffea-57de-f7ef-e740-e448564008ed@alum.mit.edu> <bf12cdc4-8d17-d6aa-cc01-5afad19127ac@seantek.com> <92276351-a21c-302c-f0c8-7b4843c9b5f7@alum.mit.edu> <0781109A-9AFB-42AA-8828-DA5CDF38C377@seantek.com>
From: Paul Kyzivat <pkyzivat@alum.mit.edu>
Message-ID: <3ce7dbb5-24d5-909e-27a2-f8447f9bf3f5@alum.mit.edu>
Date: Fri, 08 Jul 2016 18:39:40 -0400
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:45.0) Gecko/20100101 Thunderbird/45.1.1
MIME-Version: 1.0
In-Reply-To: <0781109A-9AFB-42AA-8828-DA5CDF38C377@seantek.com>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/abnf-discuss/o9UbTxnE47xVmJzuV1_27kPQoxI>
Cc: Jonathan Hansford <jonathan@hansfords.net>, "draft-seantek-constrained-abnf@ietf.org" <draft-seantek-constrained-abnf@ietf.org>, "abnf-discuss@ietf.org" <abnf-discuss@ietf.org>
Subject: Re: [abnf-discuss] Mail regarding draft-seantek-constrained-abnf
Precedence: list

On 7/8/16 3:22 PM, Sean Leonard wrote:

>> I am not well informed about Unicode, but ISTM that if you are using ABNF with it then it makes sense to only do it over Unicode scalar values, while leaving the conversion of those to/from a particular encoding like UTF-8 can be via a pre/post-processor. You need to be a glutton for punishment to define your Unicode-based grammar over octets.
>
> Perhaps. But, that is exactly what RFC 451x (LDAP series) and RFC 3629 (UTF-8 specification) do. The problem is that when you refer to RFC 3629’s ABNF literally, you are operating over octets.
>
> But it is useful to define ABNF over octets when you are dealing with octets, e.g., when you’re defining Unicode encodings themselves!
>
>> And is there any operational difference between defining ABNF over unsigned integers and defining it over Unicode scalar values?
>
> Best to look up the definitions:
> http://unicode.org/glossary/
>
> “Technically” if you are in ISO 10646-land, the applicable integers are 0 - (2^32-1). At some point it got limited to 0x10FFFF for various practical reasons, resulting in 1114112 possible code points. Like Bill Gates said, “640K ought to be enough for anybody.”
>
> A Unicode code point is any integer from 0 to 0x10FFFF. But because of the Unicode surrogate technique in the 0xD800-0xDFFF range for UTF-16, the code points 0xD800-0xDFFF can never be assigned to so-called “actual characters”. The “Unicode scalar value” definition carves out a donut hole for these values.

Interesting. As I said, I don't know all that much about Unicode.

> Unfortunately, Unicode scalar value != Unicode characters. That is because some code points are specifically called “noncharacters”, which is to say, they have semantics in the Unicode character set to mean they are not in the character set. There are 66 of them and they are interspersed (although I suppose they can be screened out with bit patterns rather than requiring a big table).

IIUC, all Unicode characters have a unique Unicode scalar value, but not 
all Unicode scalar values correspond to Unicode code points.

> Hence, making ABNF for all of this, is nontrivial. Depending on the use cases, you might want to include them, or you might not want to include them. You could, for example, use them to delimit “strings of Unicode characters” safely, assuming that you really are talking about strings of Unicode characters and not strings of arbitrary integer values that may or may not conform to some version of Unicode.
>
> In my mind, it kind of comes down to, “do you really want to modify a pretty stable standard, ABNF, to incorporate an ever-shifting standard that keeps on giving you more and more 😫?” ABNF is based on ASCII which really hasn’t changed since ’68. ASCII is not without its drawbacks (BEL, anyone?), but it has given us half a century of peace. I’m not against incorporating Unicode, but those are all of the issues.

ISTM that you are making this harder than it needs to be.

It would be difficult (and IMO undesirable) to define ABNF over Unicode 
characters or over ASCII characters. You would have to restrict the 
numeric byte values that you could use. And so ranges might also need to 
be declared invalid in some cases.

"Defined over X" could be a predicate that a verifier could decide for a 
given ABNF grammar. For instance you could verify that 3261 is defined 
over octets. (It never references values greater than 255.) That means 
it is also defined over ranges that a supersets of octets, like 16 and 
32 bit ints.

OTOH, if you analyze an ABNF grammar and find it uses character values 
greater than 255 then you know you must not use it with an input/output 
stream that works in octets.

"Defined over ASCII" and "Defined over Unicode" could also be determined 
by analysis of the grammar.

If the grammar is "defined over Unicode" then you can use I/O that 
converts to UTF-8, UTF-16, etc.

But if it is is not defined that way, then you can't safely use that 
kind of I/O. For instance SIP formally defined over UTF-8. It is 
"mostly" ASCII, but in some places allows UTF-8 by defining the 
allowable sequences of bytes. (But the syntax is *old*. I have no idea 
whether changes in Unicode have invalidated it.)

	Thanks,
	Paul

Re: [abnf-discuss] Mail regarding draft-seantek-c… Paul Kyzivat
Re: [abnf-discuss] Mail regarding draft-seantek-c… Sean Leonard
Re: [abnf-discuss] Mail regarding draft-seantek-c… Sean Leonard
Re: [abnf-discuss] Mail regarding draft-seantek-c… Sean Leonard
Re: [abnf-discuss] Mail regarding draft-seantek-c… Paul Kyzivat
Re: [abnf-discuss] Mail regarding draft-seantek-c… Paul Kyzivat
Re: [abnf-discuss] Mail regarding draft-seantek-c… Jonathan Hansford
Re: [abnf-discuss] Mail regarding draft-seantek-c… Sean Leonard
Re: [abnf-discuss] Mail regarding draft-seantek-c… Paul Kyzivat
Re: [abnf-discuss] Mail regarding draft-seantek-c… Sean Leonard