Re: [art] [I18ndir] Modern Network Unicode

John C Klensin <john-ietf@jck.com> Thu, 11 July 2019 17:25 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: art@ietfa.amsl.com
Delivered-To: art@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A677A120478; Thu, 11 Jul 2019 10:25:21 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id YI6-KVScqVSU; Thu, 11 Jul 2019 10:25:19 -0700 (PDT)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EC581120475; Thu, 11 Jul 2019 10:25:18 -0700 (PDT)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1hlcol-000L4B-CK; Thu, 11 Jul 2019 13:25:15 -0400
Date: Thu, 11 Jul 2019 13:25:09 -0400
From: John C Klensin <john-ietf@jck.com>
To: Carsten Bormann <cabo@tzi.org>
cc: "Asmus Freytag (c)" <asmusf@ix.netcom.com>, i18ndir@ietf.org, art@ietf.org
Message-ID: <DFB116527FF004C961182B15@PSB>
In-Reply-To: <C7BBF677-E752-4258-A357-AE56338F6326@tzi.org>
References: <0A5251342D480BA6437F7549@PSB> <B243365E-F7C5-4C53-A64F-2E3E87C4CD66@tzi.org> <248A8DD5DA0D3D34D6B6EFC9@PSB> <213ae024-b819-4f56-6e37-0cd53eb566c9@ix.netcom.com> <D921117F-BA9E-430B-8287-06D15248E1B7@tzi.org> <90f8f2b5-ff3d-f9f1-860c-ae4d43f92c81@ix.netcom.com> <7F1F41C25D0AC5960D95A67E@PSB> <C7BBF677-E752-4258-A357-AE56338F6326@tzi.org>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/art/Cbvz_nTqO22Zj8PgKjeiIuV4JNY>
Subject: Re: [art] [I18ndir] Modern Network Unicode
X-BeenThere: art@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Applications and Real-Time Area Discussion <art.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/art>, <mailto:art-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/art/>
List-Post: <mailto:art@ietf.org>
List-Help: <mailto:art-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/art>, <mailto:art-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 11 Jul 2019 17:25:26 -0000


--On Thursday, July 11, 2019 18:44 +0200 Carsten Bormann
<cabo@tzi.org> wrote:

> On Jul 11, 2019, at 18:31, John C Klensin <john-ietf@jck.com>
> wrote:
>> 
>> (top post and adding the ART list back in)
>> 
>> Carsten, 
>> I think Asmus and I are in complete agreement, but you might
>> find an attempt to summarize the state of work for relevant
>> characteristics and behaviors on the web, some of which
>> interact with this discussion, illuminating.  It is a chart
>> with explanation, not a long document.  See
>> https://w3c.github.io/typography/gap-analysis/language-matrix
>> .html 
> 
> Thank you for this interesting link.
> This is very condensed information; what column should I look
> for specifically on the issue of normalization and
> applicability of NFC? E.g., in the linked page
> https://w3c.github.io/iip/gap-analysis/beng-gap I find
> something about grapheme clusters but nothing about
> normalization — normalization of course is not an immediate
> need for Web interoperability.

I'll try to hunt for it when I have time unless someone else
gets to it first, but there are W3C recommendations that
specifically recommend against trying to insert a normalization
step before transmission or immediately before storage but
focusing on it as a comparison-time operation.   That is
consistent with what Asmus and I have been trying to tell you.  

Let me try to summarize those comments in a different way:

(1) For most language-script combinations, the form in which a
network application gets a string is usually going to be the
right form for that string.  If there is a strong preference
that the string be normalized, it will probably come to the
network application (and probably when it leaves something
rather close to the keyboard) normalized.  If there is a strong
preference that it not be normalized, it will probably come to
you in _that_ form.  If there is no preference at all,
normalizing the string (as long as you stay away from NFKC/NFDC)
will probably not cause harm, but it won't buy much either.

(2) Checking as to whether a string is normalized before sending
it is unlikely to produce any safely actionable information
unless you know a lot about the language, script, and
circumstances.

(3) If a string is unnormalized for what whomever created it
thinks is a good reason, then normalizing on general principles
is going to lose whatever information that decision was intended
to convey.  

The above may be different for strings that are specifically
intended as identifiers, but I don't think that is what you are
talking about.    I used to believe that "normalize everywhere"
and "store strings only in normalized form" was a good idea, but
I've been gradually (and sometimes quite unpleasantly) educated
by a long series of cases and counterexamples.   If I were
updating 5198 today, that recommendation is the thing I would be
most likely to either change or at least moderate and qualify
considerably.

    john