Re: [I18ndir] Fwd: Re: Working Group Last Call: Structured Headers for HTTP

"Asmus Freytag (c)" <asmusf@ix.netcom.com> Thu, 06 February 2020 22:18 UTC

DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dk12062016; d=ix.netcom.com; b=OYPJnmmKkVmgGD3J5eeik9ALODr6BJ3DlebXp47/dfQlryIpetgdvzLqnhn1WWg9SHYOM2cIFJ1V+WsvL0+MrqvB0og3Ljr46vb7RdPf4Vu3r4lkI9foB04FYlAmFElBXz2n8GilguD5JRAWQvAuSFzL/sLkGpMteAdKhGJ/d6zSa/82R2gcGo3Yw7Qpu/92AqgdQ6NCRNrT+LLYlUEroLKcTpWv56rmcgUOgQ56W7D2ypSRE26jT5WgUpH+Mb4EYYvDZsjUwVlTspqD5QzEszi1G35/wW+aKFgy4KS4QZU/8AD284f6PxY3kHYYFiiqDzVVSLjLU4D6SKdz70/1LA==; h=Received:Subject:To:References:From:Message-ID:Date:User-Agent:MIME-Version:In-Reply-To:Content-Type:Content-Language:X-ELNK-Trace:X-Originating-IP;
To: John C Klensin <john-ietf@jck.com>, i18ndir@ietf.org
References: <fd66eb72-2777-3f34-026b-00f4084b88ea@ix.netcom.com> <a7652163-6815-457b-b6b4-96affe237a32@ix.netcom.com> <A942D88A37437ED525455FD6@PSB> <caa945d8-e8e6-b206-710d-732b0e944c02@ix.netcom.com> <3382B597A3BB48B854C8D3BD@PSB>
From: "Asmus Freytag (c)" <asmusf@ix.netcom.com>
Message-ID: <29e332d1-2f50-e982-dd7b-ee3773a03bab@ix.netcom.com>
Date: Thu, 06 Feb 2020 14:18:53 -0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.4.2
MIME-Version: 1.0
In-Reply-To: <3382B597A3BB48B854C8D3BD@PSB>
Content-Type: multipart/alternative; boundary="------------922F16F59BB3E37BA71FF5EB"
Content-Language: en-US
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/bmZpMnLuAr1xrP_fD7VOdzqOPnU>
Subject: Re: [I18ndir] Fwd: Re: Working Group Last Call: Structured Headers for HTTP
Precedence: list

A couple of thoughts:

There are protocol parameters (like programming language keywords) and 
then there are object identifiers. For the former, issues like O vs. 0 
are annoyances - many of which can be avoided by proper choice of 
keywords. Object identifiers are chosen ad-hoc; only the repertoire and 
comparison rules are under the control of the protocol designer. And 
they may be directly supplied by/exposed to users.

Much of what you write below seems to me to primarily apply to such 
object identifiers and it's good to realize that even simple schemes 
(uppercase ASCII) have many edge conditions that need to be nailed down 
precisely. Even more true for more complex (internationalized) versions.

On compatibility: for a "bis" version of a protocol, there's always a 
question of whether it represents a break or a continuation from the 
original. Perhaps there are only some aspects that simply can't be kept 
compatible. However, anytime there is such a break, the question of 
limiting character set support to the universal character set ought to 
be raised.

On specifying utf-8 (not 9 :) ) vs. "Unicode". The high-level bit of 
this is the need to limit the Unicode encoding forms in the 
specification to a single one. Nobody benefits from having to support 
all of them.

UTF-8 has rather obvious advantages in interchange: it sidesteps the 
byte order problem (and therefore doesn't require any additional 
specification on how to uniquely serialize it to a byte stream). That 
should make it a good choice for any network protocol that isn't limited 
to stings of 7-bit items. (For those, some other serialization is 
needed, whether punycode or whatever).

This does not identify how the Unicode string is to be processed. You 
mention that IDNs work on the code point level. That's fine, and 
independent of how the data is transmitted. It is part of the 
specification that needs to be supplied anyway: how these protocol 
values are to be compared, sorted, etc. (The rules may well differ 
between keywords and object identifiers).

Finally, for containers, there's the question of what meta-data they 
should carry for their content. That's another issue altogether. If the 
content item is a document, defining the encoding it is in (or how to 
discover it) may be more universally applicable than defining a  
comparison method. The latter is more applicable to object identifiers.

Someone mentioned that this discussion should lead to a general 
guideline for protocol writers. I'm surprised there isn't one already, 
but if this is something to work out, count me in.

A./


On 2/6/2020 9:54 AM, John C Klensin wrote:
>
> --On Wednesday, February 5, 2020 20:18 -0800 "Asmus Freytag (c)"
> <asmusf@ix.netcom.com> wrote:
>
>> On 2/5/2020 5:23 PM, John C Klensin wrote:
>>> Asmus,
>>>
>>> In at least my case and I assume in Patrik's and John
>>> Levine's, when I said "ASCII", I might better have said "the
>>> ASCII graphic subset of Unicode", or "the Basic Latin
>>> repertoire of Unicode", or "Unicode code points in the range
>>> U+0030 (or perhaps U+0020) through U+007E (or maybe +007A),
>>> i.e., a repertoire, not a specific coding standard.
>> Right; so far I'm with you.
>>
>> I do agree that as far as repertoires go, it has some
>> simplicity to it, although even with "ASCII" you run into
>> questions of how you want to handle case.
> Not just case.  In different situations, one runs into issues
> with so-called "white space" characters and sequences of them,
> interpretations and matching of punctuation characters,
> line-ending conventions, and so on.  If one wanted simple and
> uncomplicated, one needs to go back to fairly early versions of
> BCP and even that, at least in in principle, could force one to
> address string sorting and collation issues associated with
> consecutive spaces and the implications of punctuation.
>
> In addition, if one tries to map between less-extensive coding
> systems and more extensive ones (or even equally-extensive
> systems that have made different choices), one can easily get
> into questions for which the answers that seem obvious on first
> glance really are not.  If one goes from BCD (which was
> upper-case-only) to ASCII, it seems obvious that the letters of
> the former should be mapped to upper-case letters of the latter,
> but that may be exactly wrong for some applications and
> operating systems.  Should a sequence of space characters be
> mapped to a horizontal tab or left alone?  And so on. The
> correct and answer may be "it depends".  It may depend on the
> application, the circumstances, or fairly subtle issues with the
> language and/or locale involved.  A recent conversation with
> people not on this list has reminded me that some of these
> issues are reflections of ones that have been around for
> millennia and for which we simply are not going to find "one
> size fits all" solutions, much less be able to specify them in a
> simple and clear way.
>
> I'm confident that you know all of of the above but it
> reinforces several other aspects of the recent conversation
> about draft-ietf-httpbis-header-structure-15, both in terms of
> what we should suggest be done with that document and some
> generalizations we might make:
>
> (1) Unless it is extremely simple, any choice of character
> repertoire is going to require additional conventions for some
> uses and applications.  "Basic Latin upper case letters only"
> may be simple enough as long as the strings are short, but, as
> soon as digits are allowed, the "O" and "0" confusion/ ambiguity
> issue arises (and a few here might be old enough to remember the
> then-nearly-universal convention of writing digit-zero with a
> slash, usually a reverse-slash, on coding sheets preparatory to
> punching cards or paper tape in the 1940s and 1950s [1]).  In
> any event, a corollary to the principle that making things more
> complex makes them harder and more error-prone is "don't do it
> unless you have to" and that is consistent with recommendations
> to stick to a restricted ASCII repertoire unless there is a real
> reason for allowing a significantly broader repertoire.
>
> (2) There may well be exceptions but much of the time arguments
> that there are strings -- especially in field values but often
> in running text -- that we don't need to be concerned about
> beyond simple character coding are likely to turn out to be
> incorrect.  For field values, the most-often example cited
> historically are email subject lines, but mail is sorted and
> search for in archives using that field all the time and
> procedures to duplicate elimination in mail archives typically
> use that field as well as date and address ones as first-level
> keys.
>
> (3) At least for i18n purposes, for a document that specifies
> conventions for fields (or name-value pairs) that have been or
> will be specified elsewhere (including container documents like
> this one about structured headers) it is probably better to
> specify what field-specific documents must specify or explain
> than to write requirements, or even relatively specific
> guidance, for all such fields.
>
>>>     Given that all of the documents to
>>> this particular discussion are pushing UTF-8 and that a
>>> Unicode string from that repertoire in UTF-8 is
>>> indistinguishable at the octet level from a ASCII string
>>> encoded by right-justifying each seven-bit ASCII code point
>>> in an eight bit byte (octet) with a leading zero bit, the
>>> shortcut should be obvious.  But you are correct in that we
>>> were talking about repertoire, not encoding or choice of
>>> character set.
>> I also agree that the ASCII repertoire has the nice property
>> that it maps to the same octets whether UTF-8, Latin-1 of many
>> other character sets.
>>
>> Whether a protocol needs to support a CCS declaration is a
>> separate question. Beyond explicitly identifying data as
>> utf-8, that is (rather than implicitly via choice of protocol).
> I agree that it is a separate question, but see above.  However,
> when we don't know how a string will be used in the future --and
> we rarely do-- supplying too much information is rarely harmful
> while supplying too little and leaving things to heuristics
> often turns out to be.
>
>> Allowing the protocol to be expressed in any other CCS than
>> utf-8 is something that in today's environment should require
>> to clear a substantial hurdle (by promising a really clear win
>> in compatibility with key installed technology and which is
>> not outweighed by imposing costs donwstream).
> I agree and would agree even more strongly had you said
> "Unicode" with UTF-9 only as a strong preference.  The reason
> for that distinction is at least one of the reasons why assorted
> operating systems use UTF-16 or UTF-32 internally: the
> variable-length relationship between a code point and the length
> of the string (in bits or octets) needed to represent it in
> UTF-8 can be a considerable nuisance in programming (as an
> example from earlier in this thread, normalization is about code
> points, not UTF-8 strings), even more than a nuisance than the
> need to use surrogates for non-BMP code points in UTF-16.  I'm
> not making that up, the "Comparison of the Advantages..."
> discussion in Section 2.5 of TUS 12.0 is, IMO, quite good and
> completely consistent with the above.  Now, if you had said
> something like "on the wire" rather than "allowing the protocol
> to be expressed", we would have been closer to agreement, but
> I'd still caution that one of the things we have learned very
> generally in the last 40 or 50 years is that strings that are
> expected to be used only internally in operating systems tend to
> leak into protocol fields.
>
> In this context, it is perhaps relevant that IDNA talks about,
> and is expressed in, code points, not an particular encoding
> form.
>
>>> That said, it may be worth remembering that, independent of
>>> what operating systems may or may not do, early Web
>>> specifications were written assuming ISO 8859-1 and that
>>> there are almost certainly some applications out there that
>>> assume that CCS when they see octets with the leading bit on.
>>> The ASCII repertoire as described above is still a proper
>>> subset, but, because 8859-1 (and 8859-x more generally) are
>>> not UTF-8-compatible, the ability to define the CCS and
>>> encoding in use may still be necessary even if the necessity
>>> is waning.
>> Precisely the reason to make sure new protocols don't add to
>> this problem other than with a gun to their heads, so to
>> speak. Definitely not on the "there might be" level of
>> reasoning, I'd think.
> I agree, but note that the document under discussion is about
> Structured Headers for HTTP and it would be very hard to argue
> that HTTP is a new protocol.
>
> best,
>     john
>
>
> [1] (I don't think I thought about it at the time, but I assume
> that would have been required to be a very clear reverse-slash
> in Norway, Denmark, etc. where U+00D8is common.  And that may be
> a reasonable example of just how rapidly complexity rises.

[I18ndir] Fwd: Working Group Last Call: Structure… Martin J. Dürst
Re: [I18ndir] Fwd: Working Group Last Call: Struc… John C Klensin
Re: [I18ndir] Fwd: Working Group Last Call: Struc… John Levine
Re: [I18ndir] Fwd: Working Group Last Call: Struc… John C Klensin
Re: [I18ndir] Fwd: Working Group Last Call: Struc… John R Levine
Re: [I18ndir] Fwd: Working Group Last Call: Struc… John C Klensin
Re: [I18ndir] Fwd: Working Group Last Call: Struc… John R Levine
Re: [I18ndir] Fwd: Working Group Last Call: Struc… John C Klensin
Re: [I18ndir] Working Group Last Call: Structured… Patrik Fältström
Re: [I18ndir] Working Group Last Call: Structured… John R Levine
Re: [I18ndir] Working Group Last Call: Structured… John C Klensin
Re: [I18ndir] Working Group Last Call: Structured… John C Klensin
Re: [I18ndir] Working Group Last Call: Structured… John C Klensin
Re: [I18ndir] Working Group Last Call: Structured… John R Levine
Re: [I18ndir] Working Group Last Call: Structured… John C Klensin
Re: [I18ndir] Working Group Last Call: Structured… Patrik Fältström
Re: [I18ndir] Working Group Last Call: Structured… Patrik Fältström
Re: [I18ndir] Working Group Last Call: Structured… John R Levine
Re: [I18ndir] Working Group Last Call: Structured… Patrik Fältström
Re: [I18ndir] Working Group Last Call: Structured… John C Klensin
Re: [I18ndir] Working Group Last Call: Structured… Patrik Fältström
Re: [I18ndir] Working Group Last Call: Structured… John C Klensin
Re: [I18ndir] Working Group Last Call: Structured… Patrik Fältström
Re: [I18ndir] Working Group Last Call: Structured… John R Levine
Re: [I18ndir] Working Group Last Call: Structured… John C Klensin
Re: [I18ndir] Working Group Last Call: Structured… Asmus Freytag
[I18ndir] Fwd: Re: Working Group Last Call: Struc… Asmus Freytag
Re: [I18ndir] Fwd: Re: Working Group Last Call: S… John C Klensin
Re: [I18ndir] Fwd: Re: Working Group Last Call: S… Asmus Freytag (c)
Re: [I18ndir] Fwd: Re: Working Group Last Call: S… John C Klensin
Re: [I18ndir] Fwd: Re: Working Group Last Call: S… Asmus Freytag (c)