Re: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]

Shawn Steele <Shawn.Steele@microsoft.com> Thu, 19 March 2015 10:21 UTC

Return-Path: <Shawn.Steele@microsoft.com>
X-Original-To: lucid@ietfa.amsl.com
Delivered-To: lucid@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 395691A897B for <lucid@ietfa.amsl.com>; Thu, 19 Mar 2015 03:21:06 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.902
X-Spam-Level:
X-Spam-Status: No, score=-1.902 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fqHzj-9wFbiW for <lucid@ietfa.amsl.com>; Thu, 19 Mar 2015 03:21:04 -0700 (PDT)
Received: from na01-bn1-obe.outbound.protection.outlook.com (mail-bn1bon0769.outbound.protection.outlook.com [IPv6:2a01:111:f400:fc10::1:769]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id E8A371A897A for <lucid@ietf.org>; Thu, 19 Mar 2015 03:21:03 -0700 (PDT)
Received: from BLUPR03MB1378.namprd03.prod.outlook.com (25.163.81.12) by BLUPR03MB1379.namprd03.prod.outlook.com (25.163.81.13) with Microsoft SMTP Server (TLS) id 15.1.118.21; Thu, 19 Mar 2015 10:20:45 +0000
Received: from BLUPR03MB1378.namprd03.prod.outlook.com ([25.163.81.12]) by BLUPR03MB1378.namprd03.prod.outlook.com ([25.163.81.12]) with mapi id 15.01.0112.000; Thu, 19 Mar 2015 10:20:45 +0000
From: Shawn Steele <Shawn.Steele@microsoft.com>
To: John C Klensin <john-ietf@jck.com>
Thread-Topic: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
Thread-Index: AQHQYeWvrIxB8XNhl06inScNBJS0IJ0jC27ggAAKHACAAB/PsIAAPkeAgAAdDOA=
Date: Thu, 19 Mar 2015 10:20:44 +0000
Message-ID: <BLUPR03MB137839811770E882A0EDA60C82010@BLUPR03MB1378.namprd03.prod.outlook.com>
References: <20150311013300.GC12479@dyn.com> <CA+9kkMDZW9yPtDxtLTfY1=VS6itvHtXHF1qdZKtXdwwORwqnew@mail.gmail.com> <55008F97.8040701@ix.netcom.com> <CA+9kkMAcgSA1Ch0B9W1Np0LMn2udegZ=AzU1b26dAi+SDcbGgg@mail.gmail.com> <CY1PR0301MB07310C68F6CFDD46AE22086F82190@CY1PR0301MB0731.namprd0 3.prod.outlook.com> <20150311200941.GV15037@mx1.yitter.info> <CY1PR0301MB0731F4EBE5EB5C3340F7059282190@CY1PR0301MB0731.namprd03.prod.outlook.com> <20150319014018.GI5743@mx1.yitter.info> <BLUPR03MB1378184CE32E928A3086665582010@BLUPR03MB1378.namprd03.prod.outlook.com> <20150319023029.GA6046@mx1.yitter.info> <BLUPR03MB137886903F15000BB01E3F5882010@BLUPR03MB1378.namprd03.prod.outlook.com> <A62526FD387D08270363E96E@JcK-HP8200.jck.com>
In-Reply-To: <A62526FD387D08270363E96E@JcK-HP8200.jck.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [50.34.94.236]
authentication-results: jck.com; dkim=none (message not signed) header.d=none;
x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:BLUPR03MB1379;
x-microsoft-antispam-prvs: <BLUPR03MB13798D52DB9C1E3B16CBC93382010@BLUPR03MB1379.namprd03.prod.outlook.com>
x-forefront-antispam-report: BMV:1; SFV:NSPM; SFS:(10019020)(6009001)(62966003)(76576001)(77156002)(33656002)(76176999)(50986999)(54356999)(102836002)(2950100001)(2900100001)(66066001)(99286002)(86362001)(93886004)(46102003)(40100003)(122556002)(110136001)(106116001)(92566002)(74316001)(87936001)(2656002)(220923002)(222073002); DIR:OUT; SFP:1102; SCL:1; SRVR:BLUPR03MB1379; H:BLUPR03MB1378.namprd03.prod.outlook.com; FPR:; SPF:None; MLV:sfv; LANG:en;
x-exchange-antispam-report-test: UriScan:;
x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(601004)(5005006)(5002010); SRVR:BLUPR03MB1379; BCL:0; PCL:0; RULEID:; SRVR:BLUPR03MB1379;
x-forefront-prvs: 052017CAF1
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
MIME-Version: 1.0
X-OriginatorOrg: microsoft.onmicrosoft.com
X-MS-Exchange-CrossTenant-originalarrivaltime: 19 Mar 2015 10:20:44.3150 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 72f988bf-86f1-41af-91ab-2d7cd011db47
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BLUPR03MB1379
Archived-At: <http://mailarchive.ietf.org/arch/msg/lucid/Tb_iuNQFEmfjciHZomqSTBXPdqc>
Cc: "lucid@ietf.org" <lucid@ietf.org>, Andrew Sullivan <ajs@anvilwalrusden.com>
Subject: Re: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
X-BeenThere: lucid@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Locale-free UniCode Identifiers \(LUCID\)" <lucid.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lucid>, <mailto:lucid-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/lucid/>
List-Post: <mailto:lucid@ietf.org>
List-Help: <mailto:lucid-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lucid>, <mailto:lucid-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 19 Mar 2015 10:21:06 -0000

> First, saying "NFC/NFKC rules" tends to obscure the issue here.

> An important property of NFC/NFD is reversibility, i.e., a dual relationship in the mathematical sense....

I was trying to make an example.

Presuming I created a system where I mapped A-C to "1" and D-G to "2" and F-P to "3" and S-Z to "4", then I could still have a system that provided machine readable unique identifiers.  There's be more than one way to write them, but the mechanism is completely unambiguous.  Later if I decided to map !@#$%^ to "5" and &*()_+ to "6", I'd have more identifiers, but they'd still be unique to the machine.

We have a function where f(input) = X.  Every input will repeatedly and perfectly result in the same X.  The set of values that X may contain may change over time as more inputs are permitted, however every input will continue to be transformed by this function into the same X.  (Except for the change from IDNA 2003 to IDNA 2008, which broke this stability, but I'm hoping we don't do that again).

So we already have mathematically unambiguous identifiers - within the system that the machine uses.

> Sorry.  I think you can argue that it doesn't add enough value to be worth the trouble.  I disagree with that given an appropriate definition of the problem and believe we don't agree
> on that definition.   But "doesn't add value"... well, we disagree.

Disagreeing is OK :)

Mathematically, we have unique identifiers.  The problem is created when we introduce the human.

> However, many of those decisions -- about coding, not about characters -- do have alternatives.

Yup, many of those were attempted in code pages.  Often with side effects that were not very nice in hindsight.

>  "Unicode could have been designed differently" (paraphrasing).

I strongly suggest that if you are passionate about how characters are encoded, and feel that you could help encode new characters better, that you participate in Unicode.  I don't pretend to know everything about Unicode, but I fear you're seriously oversimplifying the complexity of the problem.  There are reasons, some good, some bad, for why stuff is the way it is Unicode.  I don't think that the IETF should be in the business of fixing or second guessing Unicode.  If people want a better Unicode, then participate.

> Again, I don't think any of those decisions are "wrong".  But they are all problematic for the IETF's language-insensitive, fairly context-free, identifier comparison purposes.  And they are, at least IMO, worth some effort because (again, independent of discussions about "confusion"), at least,

It's only problematic once you ask a human to interpret/transcribe/read the identifier.  There's no ambiguity at the machine level.

> (ii) When different input methods, using data entry devices that are indistinguishable to the user (e.g., the alphabetic key layouts on a German keyboard for Windows, Linux, and the Mac are the same) and will produce different output (stored) strings for the same input, we are dealing with coding artifacts, not
"visual confusion".   Whether the difference in internal coding
is the decision of one system to prefer NFC and that of another to prefer NFD or the result of one typist using an "ö" key and another deciding to type an "o" and a "dead key" umlaut, we have (and IMO, should have) comparison systems that eliminate those
coding differences.    This is, from that point of view, just a
new set of coding decision differences that neither Unicode nor we arranged to compensate for earlier.

That's handled in IDNA by using NFKC.  I haven't heard that there is any problem with using INDA for identifiers due to using ö or o + umlaut to represent German.

AFAIK if the human enters the correct characters using their keyboard in a normal fashion, using the correct script for their language, and barring any transcription errors, typos or homograph confusion, IDNA will consistently point to the same record regardless of the input method.  I don't recall any examples where use of an appropriate input method/keyboard would lead to ambiguous identifiers.  

-Shawn