[Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]

Shawn Steele <Shawn.Steele@microsoft.com> Wed, 11 March 2015 21:58 UTC

Return-Path: <Shawn.Steele@microsoft.com>
X-Original-To: lucid@ietfa.amsl.com
Delivered-To: lucid@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id EBA8B1A87EF for <lucid@ietfa.amsl.com>; Wed, 11 Mar 2015 14:58:48 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.901
X-Spam-Level:
X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id M4qEhOKE7lUE for <lucid@ietfa.amsl.com>; Wed, 11 Mar 2015 14:58:45 -0700 (PDT)
Received: from na01-bn1-obe.outbound.protection.outlook.com (mail-bn1on0776.outbound.protection.outlook.com [IPv6:2a01:111:f400:fc10::776]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id DA5481A87A5 for <lucid@ietf.org>; Wed, 11 Mar 2015 14:58:44 -0700 (PDT)
Received: from CY1PR0301MB0731.namprd03.prod.outlook.com (25.160.159.149) by CY1PR0301MB0732.namprd03.prod.outlook.com (25.160.159.150) with Microsoft SMTP Server (TLS) id 15.1.106.15; Wed, 11 Mar 2015 21:58:26 +0000
Received: from CY1PR0301MB0731.namprd03.prod.outlook.com ([25.160.159.149]) by CY1PR0301MB0731.namprd03.prod.outlook.com ([25.160.159.149]) with mapi id 15.01.0106.007; Wed, 11 Mar 2015 21:58:26 +0000
From: Shawn Steele <Shawn.Steele@microsoft.com>
To: "lucid@ietf.org" <lucid@ietf.org>
Thread-Topic: [Lucid] [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
Thread-Index: AQHQXCT6v+fcyt1y/kWOJMb4v31hfZ0Xp0gsgAACDiCAAAz9gIAABLlwgAAZkeA=
Date: Wed, 11 Mar 2015 21:58:26 +0000
Message-ID: <CY1PR0301MB0731F4EBE5EB5C3340F7059282190@CY1PR0301MB0731.namprd03.prod.outlook.com>
References: <20150311013300.GC12479@dyn.com> <CA+9kkMDZW9yPtDxtLTfY1=VS6itvHtXHF1qdZKtXdwwORwqnew@mail.gmail.com> <55008F97.8040701@ix.netcom.com> <CA+9kkMAcgSA1Ch0B9W1Np0LMn2udegZ=AzU1b26dAi+SDcbGgg@mail.gmail.com> <CY1PR0301MB07310C68F6CFDD46AE22086F82190@CY1PR0301MB0731.namprd03.prod.outlook.com> <20150311200941.GV15037@mx1.yitter.info>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [2001:4898:80e0:ee43::3]
authentication-results: ietf.org; dkim=none (message not signed) header.d=none;
x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:CY1PR0301MB0732;
x-forefront-antispam-report: BMV:1; SFV:NSPM; SFS:(10019020)(51704005)(107886001)(19580395003)(76576001)(450100001)(15975445007)(110136001)(19300405004)(54356999)(2351001)(86612001)(99286002)(19580405001)(2900100001)(19617315012)(87936001)(93886004)(122556002)(86362001)(16236675004)(2656002)(77156002)(62966003)(19625215002)(40100003)(50986999)(16601075003)(106116001)(92566002)(2501003)(46102003)(74316001)(76176999)(33656002)(102836002)(3826002)(222073002); DIR:OUT; SFP:1102; SCL:1; SRVR:CY1PR0301MB0732; H:CY1PR0301MB0731.namprd03.prod.outlook.com; FPR:; SPF:None; MLV:sfv; LANG:en;
x-microsoft-antispam-prvs: <CY1PR0301MB0732B1781C6474752FF40AF482190@CY1PR0301MB0732.namprd03.prod.outlook.com>
x-exchange-antispam-report-test: UriScan:;
x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(601004)(5002009)(5005006); SRVR:CY1PR0301MB0732; BCL:0; PCL:0; RULEID:; SRVR:CY1PR0301MB0732;
x-forefront-prvs: 0512CC5201
Content-Type: multipart/alternative; boundary="_000_CY1PR0301MB0731F4EBE5EB5C3340F7059282190CY1PR0301MB0731_"
MIME-Version: 1.0
X-OriginatorOrg: microsoft.onmicrosoft.com
X-MS-Exchange-CrossTenant-originalarrivaltime: 11 Mar 2015 21:58:26.1291 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 72f988bf-86f1-41af-91ab-2d7cd011db47
X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY1PR0301MB0732
Archived-At: <http://mailarchive.ietf.org/arch/msg/lucid/PX6Bgwy2dSbxohBnqx173S-P2yk>
Subject: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
X-BeenThere: lucid@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Locale-free UniCode Identifiers \(LUCID\)" <lucid.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lucid>, <mailto:lucid-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/lucid/>
List-Post: <mailto:lucid@ietf.org>
List-Help: <mailto:lucid-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lucid>, <mailto:lucid-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 11 Mar 2015 21:58:49 -0000

Oops, I seem to have dropped the list.

From: Shawn Steele
Sent: wenSjaj, march 11, DIS 2015 tera' 14:23
To: 'Andrew Sullivan'
Subject: RE: [Lucid] [mark@macchiato.com: Re: Non-normalizable diacritics - new property]


>       Any character that can be confused for

>   another one can be called confusable, and confusability can be

>   thought of as a spectrum with "visually similar" at one end, and

>   "homoglyphs" at the other.  (We use the term "homoglyph" strictly:

>   code points that normally use the same glyph when rendered.)



It says "there's a spectrum", but then "We use the term "homoglyph" strictly:..."  That seems like an attempt at a hard line, though it does say "normally"



> ʻokina



Hmm, I'd thought I'd seen it different, but maybe not, sorry if that was a bad example.  Actually, Wikipedia has a couple examples… http://en.wikipedia.org/wiki/%CA%BBOkina That’s part of my concern with saying “normally rendered the same”.  They don’t have to be, that’s the art of creating a font.



> I think we're all aware that you can't tell developers what font to use,

> but there is clearly an in-principle difference between "could be

> mitigated with font" and "cannot possibly be mitigated with font".



I suppose some characters are less likely to be able to be mitigated with a font, however I'm not sure I could easily find a font that mitigated a large set of characters.  I could perhaps suggest that font designers consider these things when adding characters or making a new font, but (as my okina choice demonstrates) I'm not sure which font(s) have what behavior.



Although I think it's fair for an RFC to indicate that some fonts can exacerbate the problem, I'm not sure that it's fair to state that parts of the problem could be "solved" by a font.  For example, sometimes the font choice may be under the control of the attacker.



> > Additionally it continues to treat these newly noticed characters as a

> > special case without considering the many existing problems.



> Where, please?



?? 2.2.2 explicitly emphasizes the Arabic Hamza Above case, though it does go on to mention that there are other characters.  However there are many confusable characters in IDNA.  Worse than just confusable glyphs perhaps are confusable concepts, as the discussion of "pronunciation" alludes.  'okina isn't confusable if you pronounce it 'okina and pronounce the quote as a left quote.  Yet it is visually confusable.



For example, this focuses on the examples illuminated by a very esoteric Arabic code point.  I’m digressing, but I never see any discussion of 拔 vs 拨 or  暖 vs 暧 (depending on font, YMMV).  However those can be very confusable to Chinese readers, particularly if the context leads one to expect one character or the other.



> IDNA is supposed to be providing unique identifiers.  It in fact does, in the

> sense that when a series of U-labels or A-labels are put together they (respectively)

> produce exactly one FQDN that can be looked up.  That's why it's part of the problem.



At what level?  Sure, they’re unique as a binary number sequence.  But clearly they aren’t visually (or even audibly) unique, UTS#46 was created for that.  Nor are they linguistically unique as any of the numerous …nyms could cause confusion.  Nor are they a unique resource identifier as multiple names could actually point to the same place.



Can they provide a unique FQDN that maps to a single server?  Sure, but they could do that if we’d just left it at “all of Unicode”.



At the machine level, yes, IDNA provides unique identifiers.



At the human level, it most certainly does not.  For it to be unique at a human level would mean that I’d have to be able to hand a written note to another human and they’d be able to key that in with 100% accuracy.  This document notes the problems with l and 1, and obviously 0 and O, let alone any number of other problems.



I created http://L3-G0.com  However despite that, my mnemonic has led me to type l3-go on more than one occasion.  It’s also led me to tell the name to others whom I’m fairly sure have then miskeyed it.



Despite all that, I believe that IDN suffices for the purposes of allowing users to find web servers & machines that they are interested in.



However IDN does not suffice to provide an unambiguous (to humans) identifier.  Addressing the class of characters being discussed hear provides no statistically meaningful reduction of the potential ambiguity of IDN, particularly by a malicious attacker.  It also provides no statistically interesting improvement in the ability for well-intentioned users to find the resource they’re looking for, as presumably the users interested in the variations discussed here will be fully capable of using the correct variation.



So I don’t know what our goals are here.  I can envision several possibilities:



A)     A machine identifier suitable for uniquely identifying resources.

a.      IMO, this is already solved.  No machine is going to confuse any of these, they’re just numbers.

b.      Alternatively machines could use actual numbers (or GUIDs, or a specific bit of ASCII, or any other of a number of ways of generating true unique identifiers).

B)     We want a human safe unique identifier.

a.      The problem described here has no meaningful impact on the scope of the problem.

b.      I’m not sure this is achievable.  It’ll be hard for sure.

c.      IMO such an identifier could not also be perfectly linguistic.

                                                    i.     We’d have to map things that could be confused together in a way that would ensure there was no ambiguity in the resulting identifier.

                                                   ii.     Eg: map Latin, Greek & Cyrillic as appropriate.

                                                  iii.     Clearly such an ID couldn’t round trip to a “pretty” human form.  Though we could ensure a pretty human form always mapped to the same ID

d.      This isn’t IDNA

C)     We want to provide “safe” browsing using IDNA.

a.      Again, the issues being discussed here have no meaningful impact on the scope of this problem

b.      That would need to consider some of the far grayer areas than are being discussed here.

c.      I think other technologies are appropriate here, eg: spam filters, blacklists, safety plugins, certificates, etc.



Clearly some people think that the class of the problem here is a real problem, and presumably that it is worth the effort to attempt to solve the perceived problem.  In other words, that solving this would provide some statistical and measurable improvement in the security of IDNA.  However, I’m not able to follow the logic.



Perhaps if the document could provide some sort of statement of the reduction of confusable characters that would help.  Eg:  “Addressing this issue will reduce 90% of the confusion of IDN labels”.



-Shawn