Re: [I18ndir] Fwd: Re: Working Group Last Call: Structured Headers for HTTP

"Asmus Freytag (c)" <asmusf@ix.netcom.com> Thu, 06 February 2020 04:18 UTC

Return-Path: <asmusf@ix.netcom.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id DF338120058 for <i18ndir@ietfa.amsl.com>; Wed, 5 Feb 2020 20:18:13 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.996
X-Spam-Level:
X-Spam-Status: No, score=-1.996 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=ix.netcom.com; domainkeys=pass (2048-bit key) header.from=asmusf@ix.netcom.com header.d=ix.netcom.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id MP02lbFjEG9L for <i18ndir@ietfa.amsl.com>; Wed, 5 Feb 2020 20:18:10 -0800 (PST)
Received: from elasmtp-galgo.atl.sa.earthlink.net (elasmtp-galgo.atl.sa.earthlink.net [209.86.89.61]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 93B8312003E for <i18ndir@ietf.org>; Wed, 5 Feb 2020 20:18:09 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ix.netcom.com; s=dk12062016; t=1580962690; bh=UuV8+gV/SMG2MhykaeSzp8UjfpJr6DXtunSN oSEkNBk=; h=Received:Subject:To:References:From:Message-ID:Date: User-Agent:MIME-Version:In-Reply-To:Content-Type:Content-Language: X-ELNK-Trace:X-Originating-IP; b=XEE1S0tL1f9VbjELo4HCNns3MZScW5M1T HHrlyPnQ7lp5LgCT5C+0jeuw4vAZVRIgq1vhr3jcjVms4Qadfk2/z5dnobA3y0b26NZ gEkqzu2m/caZj3khuqlFvnOetcQBo6g8G48wfDF/31MdiTC0O52aEWfiJKOGriNDfeL hpDv45dXhE0AceWlTRVzcptzX7Igox2Jrmukbb5jrkcAIMnxnyF0uGsirrZ3Uz5G9D2 bzuX14aJgFgIcj70Li24LzN74l3N+k1VGsY4WyXplZpLNT4uJuxYL3/7wL8BFrDoZnI 1hOM3Xpjja4wlRDr+2PPbc0XVBHxFFBCLBWgHU9mg==
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dk12062016; d=ix.netcom.com; b=eCBb17hjwNY8FkA7cL6ETQJ5frBY6k/WBFsLYdX/NTd5zGJbmUg6ZXC40hgg4JoWe5LSGujR/lA7d/6G9AxnTCOQnbLtoUrQCY8SpidltXNyBD+pzbzM2SIdL6/6dIV8QPgcEEFUl8iIJmVm6bW8WV9WuP1xFwsEvjIOOBUex8H9D3HQeSuaRmfAdGv+vIlLuwFNixyQ4vEcv6OSZ4ehg4XRKZz7Mj/YTi0kEDizU3Q4GMpEXuQdeTPkfe+c/x/siN5P7ygIuPKg+OAJLJF1qbxlEnyJkYY/q5x2Loa4O02OqDe3nbJl+TfcsjWk+u0tVOwZBJCaptSNwSUOaHwfOQ==; h=Received:Subject:To:References:From:Message-ID:Date:User-Agent:MIME-Version:In-Reply-To:Content-Type:Content-Language:X-ELNK-Trace:X-Originating-IP;
Received: from [75.172.116.119] (helo=[192.168.1.106]) by elasmtp-galgo.atl.sa.earthlink.net with esmtpa (Exim 4) (envelope-from <asmusf@ix.netcom.com>) id 1izYcC-0001D4-2S; Wed, 05 Feb 2020 23:18:08 -0500
To: John C Klensin <john-ietf@jck.com>, i18ndir@ietf.org
References: <fd66eb72-2777-3f34-026b-00f4084b88ea@ix.netcom.com> <a7652163-6815-457b-b6b4-96affe237a32@ix.netcom.com> <A942D88A37437ED525455FD6@PSB>
From: "Asmus Freytag (c)" <asmusf@ix.netcom.com>
Message-ID: <caa945d8-e8e6-b206-710d-732b0e944c02@ix.netcom.com>
Date: Wed, 05 Feb 2020 20:18:06 -0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.4.2
MIME-Version: 1.0
In-Reply-To: <A942D88A37437ED525455FD6@PSB>
Content-Type: multipart/alternative; boundary="------------48AFC1E9097A3BD2C315C0FA"
Content-Language: en-US
X-ELNK-Trace: 464f085de979d7246f36dc87813833b26976a2cdabd2db7a9a8d723114c0d609cc681f7cf499b0d3350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c
X-Originating-IP: 75.172.116.119
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/Ma7GrB41dL0-9YL7B445JjxKRYs>
Subject: Re: [I18ndir] Fwd: Re: Working Group Last Call: Structured Headers for HTTP
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 06 Feb 2020 04:18:14 -0000

On 2/5/2020 5:23 PM, John C Klensin wrote:
> Asmus,
>
> In at least my case and I assume in Patrik's and John Levine's,
> when I said "ASCII", I might better have said "the ASCII graphic
> subset of Unicode", or "the Basic Latin repertoire of Unicode",
> or "Unicode code points in the range U+0030 (or perhaps U+0020)
> through U+007E (or maybe +007A), i.e., a repertoire, not a
> specific coding standard.

Right; so far I'm with you.

I do agree that as far as repertoires go, it has some simplicity to it, 
although even with "ASCII" you run into questions of how you want to 
handle case.

>    Given that all of the documents to
> this particular discussion are pushing UTF-8 and that a Unicode
> string from that repertoire in UTF-8 is indistinguishable at the
> octet level from a ASCII string encoded by right-justifying each
> seven-bit ASCII code point in an eight bit byte (octet) with a
> leading zero bit, the shortcut should be obvious.  But you are
> correct in that we were talking about repertoire, not encoding
> or choice of character set.

I also agree that the ASCII repertoire has the nice property that it 
maps to the same octets whether UTF-8, Latin-1 of many other character sets.

Whether a protocol needs to support a CCS declaration is a separate 
question. Beyond explicitly identifying data as utf-8, that is (rather 
than implicitly via choice of protocol).

Allowing the protocol to be expressed in any other CCS than utf-8 is 
something that in today's environment should require to clear a 
substantial hurdle (by promising a really clear win in compatibility 
with key installed technology and which is not outweighed by imposing 
costs donwstream).

>
> That said, it may be worth remembering that, independent of what
> operating systems may or may not do, early Web specifications
> were written assuming ISO 8859-1 and that there are almost
> certainly some applications out there that assume that CCS when
> they see octets with the leading bit on.  The ASCII repertoire
> as described above is still a proper subset, but, because 8859-1
> (and 8859-x more generally) are not UTF-8-compatible, the
> ability to define the CCS and encoding in use may still be
> necessary even if the necessity is waning.

Precisely the reason to make sure new protocols don't add to this 
problem other than with a gun to their heads, so to speak. Definitely 
not on the "there might be" level of reasoning, I'd think.

A./

>
> best,
>    john
>
>
>
> --On Wednesday, February 5, 2020 14:50 -0800 Asmus Freytag
> <asmusf@ix.netcom.com> wrote:
>
>> (didn't go out to the list when I first sent it)
>>
>> When I read the word "exception" I always think of this:
>>
>> When we built the first Unicode-enabled OS (Windows NT), we
>> had a long discussion of which "strings" in the OS needed to
>> be Unicode.
>>
>> Some thought that there was a clear dividing line between data
>> and what would be called "protocol values" in another context.
>>
>> Some of the latter did look like they were easily limited to
>> ASCII; but everywhere we found "exceptions". There might be a
>> set of enumerable tokens, but it allowed extended values that
>> were network or file identifiers.
>>
>> After exhaustively researching everything, the conclusion was
>> that every single string in the OS had to be Unicode (and
>> making any exceptions was either not possible, or not worth
>> the effort).
>>
>> However, while all strings were encoded in Unicode, not all
>> string values were allowed. While file names could be
>> localized (within the limits of file system syntax), some of
>> the enumerated strings were left limited to the ASCII set in
>> repertoire (even if encoded in Unicode).
>>
>> Reading this discussion (and I'm sorry I don't have the time
>> right now to properly delve into the details) it seems that a
>> natural recommendation would be to require Unicode for any
>> native representation and, if necessary (or possible), limit
>> the repertoire.
>>
>> This also requires a definition of the matching protocol for
>> all strings that are to be matched as part of the protocol (or
>> should be searchable). For any format, that would cover issues
>> of casing, white space handling etc., but for Unicode, by
>> necessity, that also requires defining the normalization form
>> to be used.
>>
>> A./
>>
>> PS: given how few systems these days natively operate in any
>> character set other than Unicode, I am always astonished at
>> the length to which people go to justify not making something
>> native Unicode. They just pick up conversion issues when they
>> use platform libraries to do any work or display.
>>
>