Re: [I18ndir] Fwd: Re: Working Group Last Call: Structured Headers for HTTP

John C Klensin <john-ietf@jck.com> Thu, 06 February 2020 01:23 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3D880120044 for <i18ndir@ietfa.amsl.com>; Wed, 5 Feb 2020 17:23:12 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id q953LEm7q7t0 for <i18ndir@ietfa.amsl.com>; Wed, 5 Feb 2020 17:23:10 -0800 (PST)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id CF5DD12001B for <i18ndir@ietf.org>; Wed, 5 Feb 2020 17:23:10 -0800 (PST)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1izVsr-0002rr-9p; Wed, 05 Feb 2020 20:23:09 -0500
Date: Wed, 05 Feb 2020 20:23:02 -0500
From: John C Klensin <john-ietf@jck.com>
To: Asmus Freytag <asmusf@ix.netcom.com>, i18ndir@ietf.org
Message-ID: <A942D88A37437ED525455FD6@PSB>
In-Reply-To: <a7652163-6815-457b-b6b4-96affe237a32@ix.netcom.com>
References: <fd66eb72-2777-3f34-026b-00f4084b88ea@ix.netcom.com> <a7652163-6815-457b-b6b4-96affe237a32@ix.netcom.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/Dt46CZCUO88C1YTvtFhxQGDtl3I>
Subject: Re: [I18ndir] Fwd: Re: Working Group Last Call: Structured Headers for HTTP
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 06 Feb 2020 01:23:12 -0000

Asmus,

In at least my case and I assume in Patrik's and John Levine's,
when I said "ASCII", I might better have said "the ASCII graphic
subset of Unicode", or "the Basic Latin repertoire of Unicode",
or "Unicode code points in the range U+0030 (or perhaps U+0020)
through U+007E (or maybe +007A), i.e., a repertoire, not a
specific coding standard.  Given that all of the documents to
this particular discussion are pushing UTF-8 and that a Unicode
string from that repertoire in UTF-8 is indistinguishable at the
octet level from a ASCII string encoded by right-justifying each
seven-bit ASCII code point in an eight bit byte (octet) with a
leading zero bit, the shortcut should be obvious.  But you are
correct in that we were talking about repertoire, not encoding
or choice of character set.  

That said, it may be worth remembering that, independent of what
operating systems may or may not do, early Web specifications
were written assuming ISO 8859-1 and that there are almost
certainly some applications out there that assume that CCS when
they see octets with the leading bit on.  The ASCII repertoire
as described above is still a proper subset, but, because 8859-1
(and 8859-x more generally) are not UTF-8-compatible, the
ability to define the CCS and encoding in use may still be
necessary even if the necessity is waning.

best,
  john



--On Wednesday, February 5, 2020 14:50 -0800 Asmus Freytag
<asmusf@ix.netcom.com> wrote:

> (didn't go out to the list when I first sent it)
> 
> When I read the word "exception" I always think of this:
> 
> When we built the first Unicode-enabled OS (Windows NT), we
> had a long discussion of which "strings" in the OS needed to
> be Unicode.
> 
> Some thought that there was a clear dividing line between data
> and what would be called "protocol values" in another context.
> 
> Some of the latter did look like they were easily limited to
> ASCII; but everywhere we found "exceptions". There might be a
> set of enumerable tokens, but it allowed extended values that
> were network or file identifiers.
> 
> After exhaustively researching everything, the conclusion was
> that every single string in the OS had to be Unicode (and
> making any exceptions was either not possible, or not worth
> the effort).
> 
> However, while all strings were encoded in Unicode, not all
> string values were allowed. While file names could be
> localized (within the limits of file system syntax), some of
> the enumerated strings were left limited to the ASCII set in
> repertoire (even if encoded in Unicode).
> 
> Reading this discussion (and I'm sorry I don't have the time
> right now to properly delve into the details) it seems that a
> natural recommendation would be to require Unicode for any
> native representation and, if necessary (or possible), limit
> the repertoire.
> 
> This also requires a definition of the matching protocol for
> all strings that are to be matched as part of the protocol (or
> should be searchable). For any format, that would cover issues
> of casing, white space handling etc., but for Unicode, by
> necessity, that also requires defining the normalization form
> to be used.
> 
> A./
> 
> PS: given how few systems these days natively operate in any
> character set other than Unicode, I am always astonished at
> the length to which people go to justify not making something
> native Unicode. They just pick up conversion issues when they
> use platform libraries to do any work or display.
>