Re: [I18ndir] Fwd: Re: Working Group Last Call: Structured Headers for HTTP

John C Klensin <john-ietf@jck.com> Thu, 06 February 2020 17:54 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 344D2120970 for <i18ndir@ietfa.amsl.com>; Thu, 6 Feb 2020 09:54:37 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id oBjPY5hy8K0F for <i18ndir@ietfa.amsl.com>; Thu, 6 Feb 2020 09:54:35 -0800 (PST)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id CDC3A12096F for <i18ndir@ietf.org>; Thu, 6 Feb 2020 09:54:34 -0800 (PST)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1izlMH-0004uX-6N; Thu, 06 Feb 2020 12:54:33 -0500
Date: Thu, 06 Feb 2020 12:54:26 -0500
From: John C Klensin <john-ietf@jck.com>
To: "Asmus Freytag (c)" <asmusf@ix.netcom.com>, i18ndir@ietf.org
Message-ID: <3382B597A3BB48B854C8D3BD@PSB>
In-Reply-To: <caa945d8-e8e6-b206-710d-732b0e944c02@ix.netcom.com>
References: <fd66eb72-2777-3f34-026b-00f4084b88ea@ix.netcom.com> <a7652163-6815-457b-b6b4-96affe237a32@ix.netcom.com> <A942D88A37437ED525455FD6@PSB> <caa945d8-e8e6-b206-710d-732b0e944c02@ix.netcom.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/ZUmCNfjVqVVnbtCK1a1QbUy4ctg>
Subject: Re: [I18ndir] Fwd: Re: Working Group Last Call: Structured Headers for HTTP
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 06 Feb 2020 17:54:37 -0000


--On Wednesday, February 5, 2020 20:18 -0800 "Asmus Freytag (c)"
<asmusf@ix.netcom.com> wrote:

> On 2/5/2020 5:23 PM, John C Klensin wrote:
>> Asmus,
>> 
>> In at least my case and I assume in Patrik's and John
>> Levine's, when I said "ASCII", I might better have said "the
>> ASCII graphic subset of Unicode", or "the Basic Latin
>> repertoire of Unicode", or "Unicode code points in the range
>> U+0030 (or perhaps U+0020) through U+007E (or maybe +007A),
>> i.e., a repertoire, not a specific coding standard.
> 
> Right; so far I'm with you.
> 
> I do agree that as far as repertoires go, it has some
> simplicity to it, although even with "ASCII" you run into
> questions of how you want to handle case.

Not just case.  In different situations, one runs into issues
with so-called "white space" characters and sequences of them,
interpretations and matching of punctuation characters,
line-ending conventions, and so on.  If one wanted simple and
uncomplicated, one needs to go back to fairly early versions of
BCP and even that, at least in in principle, could force one to
address string sorting and collation issues associated with
consecutive spaces and the implications of punctuation.

In addition, if one tries to map between less-extensive coding
systems and more extensive ones (or even equally-extensive
systems that have made different choices), one can easily get
into questions for which the answers that seem obvious on first
glance really are not.  If one goes from BCD (which was
upper-case-only) to ASCII, it seems obvious that the letters of
the former should be mapped to upper-case letters of the latter,
but that may be exactly wrong for some applications and
operating systems.  Should a sequence of space characters be
mapped to a horizontal tab or left alone?  And so on. The
correct and answer may be "it depends".  It may depend on the
application, the circumstances, or fairly subtle issues with the
language and/or locale involved.  A recent conversation with
people not on this list has reminded me that some of these
issues are reflections of ones that have been around for
millennia and for which we simply are not going to find "one
size fits all" solutions, much less be able to specify them in a
simple and clear way.  

I'm confident that you know all of of the above but it
reinforces several other aspects of the recent conversation
about draft-ietf-httpbis-header-structure-15, both in terms of
what we should suggest be done with that document and some
generalizations we might make:

(1) Unless it is extremely simple, any choice of character
repertoire is going to require additional conventions for some
uses and applications.  "Basic Latin upper case letters only"
may be simple enough as long as the strings are short, but, as
soon as digits are allowed, the "O" and "0" confusion/ ambiguity
issue arises (and a few here might be old enough to remember the
then-nearly-universal convention of writing digit-zero with a
slash, usually a reverse-slash, on coding sheets preparatory to
punching cards or paper tape in the 1940s and 1950s [1]).  In
any event, a corollary to the principle that making things more
complex makes them harder and more error-prone is "don't do it
unless you have to" and that is consistent with recommendations
to stick to a restricted ASCII repertoire unless there is a real
reason for allowing a significantly broader repertoire.

(2) There may well be exceptions but much of the time arguments
that there are strings -- especially in field values but often
in running text -- that we don't need to be concerned about
beyond simple character coding are likely to turn out to be
incorrect.  For field values, the most-often example cited
historically are email subject lines, but mail is sorted and
search for in archives using that field all the time and
procedures to duplicate elimination in mail archives typically
use that field as well as date and address ones as first-level
keys. 

(3) At least for i18n purposes, for a document that specifies
conventions for fields (or name-value pairs) that have been or
will be specified elsewhere (including container documents like
this one about structured headers) it is probably better to
specify what field-specific documents must specify or explain
than to write requirements, or even relatively specific
guidance, for all such fields.

>>    Given that all of the documents to
>> this particular discussion are pushing UTF-8 and that a
>> Unicode string from that repertoire in UTF-8 is
>> indistinguishable at the octet level from a ASCII string
>> encoded by right-justifying each seven-bit ASCII code point
>> in an eight bit byte (octet) with a leading zero bit, the
>> shortcut should be obvious.  But you are correct in that we
>> were talking about repertoire, not encoding or choice of
>> character set.
> 
> I also agree that the ASCII repertoire has the nice property
> that it maps to the same octets whether UTF-8, Latin-1 of many
> other character sets.
> 
> Whether a protocol needs to support a CCS declaration is a
> separate question. Beyond explicitly identifying data as
> utf-8, that is (rather than implicitly via choice of protocol).

I agree that it is a separate question, but see above.  However,
when we don't know how a string will be used in the future --and
we rarely do-- supplying too much information is rarely harmful
while supplying too little and leaving things to heuristics
often turns out to be.

> Allowing the protocol to be expressed in any other CCS than
> utf-8 is something that in today's environment should require
> to clear a substantial hurdle (by promising a really clear win
> in compatibility with key installed technology and which is
> not outweighed by imposing costs donwstream).

I agree and would agree even more strongly had you said
"Unicode" with UTF-9 only as a strong preference.  The reason
for that distinction is at least one of the reasons why assorted
operating systems use UTF-16 or UTF-32 internally: the
variable-length relationship between a code point and the length
of the string (in bits or octets) needed to represent it in
UTF-8 can be a considerable nuisance in programming (as an
example from earlier in this thread, normalization is about code
points, not UTF-8 strings), even more than a nuisance than the
need to use surrogates for non-BMP code points in UTF-16.  I'm
not making that up, the "Comparison of the Advantages..."
discussion in Section 2.5 of TUS 12.0 is, IMO, quite good and
completely consistent with the above.  Now, if you had said
something like "on the wire" rather than "allowing the protocol
to be expressed", we would have been closer to agreement, but
I'd still caution that one of the things we have learned very
generally in the last 40 or 50 years is that strings that are
expected to be used only internally in operating systems tend to
leak into protocol fields.

In this context, it is perhaps relevant that IDNA talks about,
and is expressed in, code points, not an particular encoding
form.

>> That said, it may be worth remembering that, independent of
>> what operating systems may or may not do, early Web
>> specifications were written assuming ISO 8859-1 and that
>> there are almost certainly some applications out there that
>> assume that CCS when they see octets with the leading bit on.
>> The ASCII repertoire as described above is still a proper
>> subset, but, because 8859-1 (and 8859-x more generally) are
>> not UTF-8-compatible, the ability to define the CCS and
>> encoding in use may still be necessary even if the necessity
>> is waning.
> 
> Precisely the reason to make sure new protocols don't add to
> this problem other than with a gun to their heads, so to
> speak. Definitely not on the "there might be" level of
> reasoning, I'd think.

I agree, but note that the document under discussion is about
Structured Headers for HTTP and it would be very hard to argue
that HTTP is a new protocol.

best,
   john


[1] (I don't think I thought about it at the time, but I assume
that would have been required to be a very clear reverse-slash
in Norway, Denmark, etc. where U+00D8is common.  And that may be
a reasonable example of just how rapidly complexity rises.