Re: [abnf-discuss] Mail regarding draft-seantek-constrained-abnf

Sean Leonard <dev+ietf@seantek.com> Sat, 09 July 2016 07:22 UTC

Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
From: Sean Leonard <dev+ietf@seantek.com>
In-Reply-To: <3ce7dbb5-24d5-909e-27a2-f8447f9bf3f5@alum.mit.edu>
Date: Sat, 09 Jul 2016 00:22:53 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <20989513-35C9-4AAC-8414-135392DEF6AC@seantek.com>
References: <20160708121202.A3E5D12D19F@ietfa.amsl.com> <43d3ffea-57de-f7ef-e740-e448564008ed@alum.mit.edu> <bf12cdc4-8d17-d6aa-cc01-5afad19127ac@seantek.com> <92276351-a21c-302c-f0c8-7b4843c9b5f7@alum.mit.edu> <0781109A-9AFB-42AA-8828-DA5CDF38C377@seantek.com> <3ce7dbb5-24d5-909e-27a2-f8447f9bf3f5@alum.mit.edu>
To: Paul Kyzivat <pkyzivat@alum.mit.edu>
Archived-At: <https://mailarchive.ietf.org/arch/msg/abnf-discuss/5FDsEyFJpmD6lL2hyrTlGQvUAVk>
Cc: Jonathan Hansford <jonathan@hansfords.net>, "draft-seantek-constrained-abnf@ietf.org" <draft-seantek-constrained-abnf@ietf.org>, "abnf-discuss@ietf.org" <abnf-discuss@ietf.org>
Subject: Re: [abnf-discuss] Mail regarding draft-seantek-constrained-abnf
Precedence: list

> On Jul 8, 2016, at 3:39 PM, Paul Kyzivat <pkyzivat@alum.mit.edu> wrote:
> 
> On 7/8/16 3:22 PM, Sean Leonard wrote:
> 
>>> I am not well informed about Unicode, but ISTM that if you are using ABNF with it then it makes sense to only do it over Unicode scalar values, while leaving the conversion of those to/from a particular encoding like UTF-8 can be via a pre/post-processor. You need to be a glutton for punishment to define your Unicode-based grammar over octets.
>> 
>> Perhaps. But, that is exactly what RFC 451x (LDAP series) and RFC 3629 (UTF-8 specification) do. The problem is that when you refer to RFC 3629’s ABNF literally, you are operating over octets.
>> 
>> But it is useful to define ABNF over octets when you are dealing with octets, e.g., when you’re defining Unicode encodings themselves!
>> 
>>> And is there any operational difference between defining ABNF over unsigned integers and defining it over Unicode scalar values?
>> 
>> Best to look up the definitions:
>> http://unicode.org/glossary/
>> 
>> “Technically” if you are in ISO 10646-land, the applicable integers are 0 - (2^32-1). At some point it got limited to 0x10FFFF for various practical reasons, resulting in 1114112 possible code points. Like Bill Gates said, “640K ought to be enough for anybody.”
>> 
>> A Unicode code point is any integer from 0 to 0x10FFFF. But because of the Unicode surrogate technique in the 0xD800-0xDFFF range for UTF-16, the code points 0xD800-0xDFFF can never be assigned to so-called “actual characters”. The “Unicode scalar value” definition carves out a donut hole for these values.
> 
> Interesting. As I said, I don't know all that much about Unicode.
> 
>> Unfortunately, Unicode scalar value != Unicode characters. That is because some code points are specifically called “noncharacters”, which is to say, they have semantics in the Unicode character set to mean they are not in the character set. There are 66 of them and they are interspersed (although I suppose they can be screened out with bit patterns rather than requiring a big table).
> 
> IIUC, all Unicode characters have a unique Unicode scalar value, but not all Unicode scalar values correspond to Unicode code points.

Not quite; I think you may have it slightly out-of-order:

Unicode code points (0 - 10FFFF) >
 Unicode scalar values (0-D7FF, E000-10FFFF) >
  “Unicode characters” (excluding the noncharacters/noncharacter code points) (0-D7FF, E000-FDCF, FDF0-FFEF, 10000-1FFFD, 20000-2FFFD .. 100000-10FFFD) >
   assigned characters (expands as the Standard advances)

In Unicode, the terms “code point” and “characters” are sometimes interchangeable, and sometimes not. Code points are definitely 0 - 10FFFF. A code point is not a “character” in certain circumstances, for example, if the code point is a “noncharacter” (literally, it’s a code point that is assigned to be NOT a character), or if the code point is an “unassigned/reserved code point” (literally, it is a code point that is NOT assigned, but is reserved for future assignment; once assigned it presumably fits the definition of a character).

-Sean

Re: [abnf-discuss] Mail regarding draft-seantek-c… Paul Kyzivat
Re: [abnf-discuss] Mail regarding draft-seantek-c… Sean Leonard
Re: [abnf-discuss] Mail regarding draft-seantek-c… Sean Leonard
Re: [abnf-discuss] Mail regarding draft-seantek-c… Sean Leonard
Re: [abnf-discuss] Mail regarding draft-seantek-c… Paul Kyzivat
Re: [abnf-discuss] Mail regarding draft-seantek-c… Paul Kyzivat
Re: [abnf-discuss] Mail regarding draft-seantek-c… Jonathan Hansford
Re: [abnf-discuss] Mail regarding draft-seantek-c… Sean Leonard
Re: [abnf-discuss] Mail regarding draft-seantek-c… Paul Kyzivat
Re: [abnf-discuss] Mail regarding draft-seantek-c… Sean Leonard