Re: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]

Shawn Steele <Shawn.Steele@microsoft.com> Thu, 19 March 2015 02:12 UTC

Return-Path: <Shawn.Steele@microsoft.com>
X-Original-To: lucid@ietfa.amsl.com
Delivered-To: lucid@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 780D81A8029 for <lucid@ietfa.amsl.com>; Wed, 18 Mar 2015 19:12:01 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.003
X-Spam-Level:
X-Spam-Status: No, score=-0.003 tagged_above=-999 required=5 tests=[BAYES_40=-0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3GFyzOYUMzH0 for <lucid@ietfa.amsl.com>; Wed, 18 Mar 2015 19:11:59 -0700 (PDT)
Received: from na01-bl2-obe.outbound.protection.outlook.com (mail-bl2on0105.outbound.protection.outlook.com [65.55.169.105]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 9E82B1A6FF2 for <lucid@ietf.org>; Wed, 18 Mar 2015 19:11:59 -0700 (PDT)
Received: from BLUPR03MB1378.namprd03.prod.outlook.com (25.163.81.12) by BLUPR03MB1380.namprd03.prod.outlook.com (25.163.81.139) with Microsoft SMTP Server (TLS) id 15.1.112.19; Thu, 19 Mar 2015 02:11:56 +0000
Received: from BLUPR03MB1378.namprd03.prod.outlook.com ([25.163.81.12]) by BLUPR03MB1378.namprd03.prod.outlook.com ([25.163.81.12]) with mapi id 15.01.0112.000; Thu, 19 Mar 2015 02:11:57 +0000
From: Shawn Steele <Shawn.Steele@microsoft.com>
To: Andrew Sullivan <ajs@anvilwalrusden.com>, "lucid@ietf.org" <lucid@ietf.org>
Thread-Topic: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
Thread-Index: AQHQYeWvrIxB8XNhl06inScNBJS0IJ0jC27g
Date: Thu, 19 Mar 2015 02:11:56 +0000
Message-ID: <BLUPR03MB1378184CE32E928A3086665582010@BLUPR03MB1378.namprd03.prod.outlook.com>
References: <20150311013300.GC12479@dyn.com> <CA+9kkMDZW9yPtDxtLTfY1=VS6itvHtXHF1qdZKtXdwwORwqnew@mail.gmail.com> <55008F97.8040701@ix.netcom.com> <CA+9kkMAcgSA1Ch0B9W1Np0LMn2udegZ=AzU1b26dAi+SDcbGgg@mail.gmail.com> <CY1PR0301MB07310C68F6CFDD46AE22086F82190@CY1PR0301MB0731.namprd03.prod.outlook.com> <20150311200941.GV15037@mx1.yitter.info> <CY1PR0301MB0731F4EBE5EB5C3340F7059282190@CY1PR0301MB0731.namprd03.prod.outlook.com> <20150319014018.GI5743@mx1.yitter.info>
In-Reply-To: <20150319014018.GI5743@mx1.yitter.info>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [50.34.94.236]
authentication-results: anvilwalrusden.com; dkim=none (message not signed) header.d=none;
x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:BLUPR03MB1380;
x-microsoft-antispam-prvs: <BLUPR03MB13803BEE5CD2A7E08BA7271182010@BLUPR03MB1380.namprd03.prod.outlook.com>
x-forefront-antispam-report: BMV:1; SFV:NSPM; SFS:(10019020)(6009001)(51704005)(2950100001)(2656002)(54356999)(66066001)(40100003)(92566002)(2900100001)(50986999)(102836002)(62966003)(77156002)(76576001)(2501003)(74316001)(46102003)(99286002)(33656002)(86362001)(107886001)(87936001)(106116001)(122556002)(222073002)(220923002); DIR:OUT; SFP:1102; SCL:1; SRVR:BLUPR03MB1380; H:BLUPR03MB1378.namprd03.prod.outlook.com; FPR:; SPF:None; MLV:sfv; LANG:en;
x-exchange-antispam-report-test: UriScan:;
x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(601004)(5005006)(5002010); SRVR:BLUPR03MB1380; BCL:0; PCL:0; RULEID:; SRVR:BLUPR03MB1380;
x-forefront-prvs: 052017CAF1
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
MIME-Version: 1.0
X-OriginatorOrg: microsoft.onmicrosoft.com
X-MS-Exchange-CrossTenant-originalarrivaltime: 19 Mar 2015 02:11:56.8706 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 72f988bf-86f1-41af-91ab-2d7cd011db47
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BLUPR03MB1380
Archived-At: <http://mailarchive.ietf.org/arch/msg/lucid/q1d7ldeDzb98rsOGtXuB2o9aAkw>
Subject: Re: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
X-BeenThere: lucid@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Locale-free UniCode Identifiers \(LUCID\)" <lucid.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lucid>, <mailto:lucid-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/lucid/>
List-Post: <mailto:lucid@ietf.org>
List-Help: <mailto:lucid-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lucid>, <mailto:lucid-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 19 Mar 2015 02:12:01 -0000

> > > IDNA is supposed to be providing unique identifiers.

> > At what level?

> At every level.  DNS names are exact-match.  Each identifier is unique.

For the machine, sure.  But if you throw in font weirdness and stuff, they become non-unique (to humans, not to machines) even in ASCII.  To make them be more unique one has to impose more rules, like "use a decent font", "lowercase everything".  Etc. 

> > At the human level, it most certainly does not.

> That's sort of the point, here, though.  But I think your description of how this would work (elided) presents a false dichotomy: either everyone can get this with 100% accuracy or it's not a unique identifier.  By similar reasoning, if you cannot eliminate all crime then you should have no laws.

No, even all NFC or NFKC would be 100% unique to the machine, however IDN is stricter than that.  To humans it's impossible to be 100% unique (unless we went to only ASCII A-Z0-9 and threw out a few of those).  What I'm questioning is how unique is good enough?  By your analogy, adding a bunch of laws isn't necessarily going to stop crime.  There's a lot of debate about that, and we can see that different locales have different crime rates, some low crime and low laws, some high crime and lots of laws, and everywhere inbetween.  The challenge is to balance the results on crime with the overhead of the laws.

> > B)     We want a human safe unique identifier.

> It seems to me that, more mildly, we could just say that we want one as safe as possible, or whose "safety" we understand.  

I agree, with the caveat that we also balance the cost and the benefits/risk.  "As safe as possible" would argue against allowing both 0 and O or l and 1 to exist.  Almost nobody's argued that, so I don't think we want "As safe as possible".  


> > c.      IMO such an identifier could not also be perfectly linguistic.

> Existing identifiers of all sorts are already not linguistic.  I observe that "anvilwalrusden" is not, to my knowledge, a word in any language.  

We do seem to have some desire to be linguistic.  Otherwise Sharp-S and Greek didn't need touched.  I think the opinions of the computer scientists and users in marketing departments may differ greatly here.  Many users treat domain names more as a natural linguistic thing than a mathematical identifier.  Enough to sue over names.

> >                                                    ii.     Eg: map Latin, Greek & Cyrillic as appropriate.

> This might actually not be a bad idea.

If we need more structure, that would be my preference (kinda).  We can't be perfect, but if we mapped most of the more obvious confusables together, then we'd avoid some of this problem.

> > d.      This isn’t IDNA

> This BoF is not only about IDNA.

Which may be part of the problem.  IDNA is "permissive" and caters to linguistic representations of things.  Which IMO is hard to reconcile with a strict identifier.

-Shawn