Re: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]

Andrew Sullivan <ajs@anvilwalrusden.com> Thu, 19 March 2015 01:40 UTC

Return-Path: <ajs@anvilwalrusden.com>
X-Original-To: lucid@ietfa.amsl.com
Delivered-To: lucid@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D5A401AC3CE for <lucid@ietfa.amsl.com>; Wed, 18 Mar 2015 18:40:24 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 2.559
X-Spam-Level: **
X-Spam-Status: No, score=2.559 tagged_above=-999 required=5 tests=[BAYES_50=0.8, HELO_MISMATCH_INFO=1.448, HOST_MISMATCH_NET=0.311] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id wFQAXokCzELO for <lucid@ietfa.amsl.com>; Wed, 18 Mar 2015 18:40:22 -0700 (PDT)
Received: from mx1.yitter.info (ow5p.x.rootbsd.net [208.79.81.114]) (using TLSv1 with cipher ADH-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id AEC301AC3C8 for <lucid@ietf.org>; Wed, 18 Mar 2015 18:40:22 -0700 (PDT)
Received: from mx1.yitter.info (unknown [67.211.120.19]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.yitter.info (Postfix) with ESMTPSA id 925568A035 for <lucid@ietf.org>; Thu, 19 Mar 2015 01:40:20 +0000 (UTC)
Date: Wed, 18 Mar 2015 21:40:19 -0400
From: Andrew Sullivan <ajs@anvilwalrusden.com>
To: lucid@ietf.org
Message-ID: <20150319014018.GI5743@mx1.yitter.info>
References: <20150311013300.GC12479@dyn.com> <CA+9kkMDZW9yPtDxtLTfY1=VS6itvHtXHF1qdZKtXdwwORwqnew@mail.gmail.com> <55008F97.8040701@ix.netcom.com> <CA+9kkMAcgSA1Ch0B9W1Np0LMn2udegZ=AzU1b26dAi+SDcbGgg@mail.gmail.com> <CY1PR0301MB07310C68F6CFDD46AE22086F82190@CY1PR0301MB0731.namprd03.prod.outlook.com> <20150311200941.GV15037@mx1.yitter.info> <CY1PR0301MB0731F4EBE5EB5C3340F7059282190@CY1PR0301MB0731.namprd03.prod.outlook.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CY1PR0301MB0731F4EBE5EB5C3340F7059282190@CY1PR0301MB0731.namprd03.prod.outlook.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Archived-At: <http://mailarchive.ietf.org/arch/msg/lucid/izzVIqE56AXPZ-UOtiVZ7alSgJg>
Subject: Re: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
X-BeenThere: lucid@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Locale-free UniCode Identifiers \(LUCID\)" <lucid.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lucid>, <mailto:lucid-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/lucid/>
List-Post: <mailto:lucid@ietf.org>
List-Help: <mailto:lucid-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lucid>, <mailto:lucid-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 19 Mar 2015 01:40:25 -0000

On Wed, Mar 11, 2015 at 09:58:26PM +0000, Shawn Steele wrote:
> 
> 
> It says "there's a spectrum", but then "We use the term "homoglyph" strictly:..."  That seems like an attempt at a hard line, though it does say "normally"
> 

As I said in an exchange off-list, the point here is to situate
"homoglyph" in a spectrum of different cases, all of which are more or
less confusable.  rn and m are plainly not homoglyphs, because if you
set them in even moderately large type with serifs you can immediately
see the difference.  Latin A and Cyrillic A are just homoglyphs: no
reasonable font even tries to differentiate them, partly because as a
matter of history they evolved from the same form and therefore it's
not surprising that they look the same now.  They're different
abstract characters because they are in different writing systems, but
they look identical.

> > ʻokina
> 
> 
> 
> Hmm, I'd thought I'd seen it different, but maybe not, sorry if that was a bad example.

It wasn't a bad example.  It was a good example, but of something
different than what you intended. :) The point is precisely that these
fine details are going to matter in this discussion, so we need to
attend to them.
 
> 
> Although I think it's fair for an RFC to indicate that some fonts can exacerbate the problem, I'm not sure that it's fair to state that parts of the problem could be "solved" by a font.  For example, sometimes the font choice may be under the control of the attacker.
> 

I think the draft says "mitigate", not "solve".
 
> ?? 2.2.2 explicitly emphasizes the Arabic Hamza Above case, though it does go on to mention that there are other characters.

There's a _whole appendix_ that attempts to illustrate the range of
the issue, and the text goes out of its way to point the reader at
that and to avoid talking about specific characters except to explain
the history, because we're trying to attend to the general problem.

> For example, this focuses on the examples illuminated by a very esoteric Arabic code point.  I’m digressing, but I never see any discussion of 拔 vs 拨 or  暖 vs 暧 (depending on font, YMMV)

In fact there is a (different) Han character in the appendix.

> > IDNA is supposed to be providing unique identifiers.

> At what level?

At every level.  DNS names are exact-match.  Each identifier is
unique.

In the DNS, color.tld and colour.tld are not the same identifier.
Conceptually they are confusing for Canadians (and spelled wrong for
USians and English respectively), but they're different.  We want that
to be as reliable as possible, and we seem to have stumbled on a class
of cases for which we cannot make derivative-property rules.  I think
that might be a problem, which is why I wanted to get a discussion
going in order to understand.

> Can they provide a unique FQDN that maps to a single server?  Sure, but they could do that if we’d just left it at “all of Unicode”.
> 

No, we would not have.  Are you quite sure you understand how the
matching rules work?  It was _never_ "all of Unicode", because that doesn't even catch the normalization cases.

> At the human level, it most certainly does not.

That's sort of the point, here, though.  But I think your description
of how this would work (elided) presents a false dichotomy: either
everyone can get this with 100% accuracy or it's not a unique
identifier.  By similar reasoning, if you cannot eliminate all crime
then you should have no laws.

> B)     We want a human safe unique identifier.

It seems to me that, more mildly, we could just say that we want one
as safe as possible, or whose "safety" we understand.  Remember, this
is a BoF and the point of the I-D was to get people to understand and
articulate problems.

> b.      I’m not sure this is achievable.  It’ll be hard for sure.

Yes.

> c.      IMO such an identifier could not also be perfectly linguistic.

Existing identifiers of all sorts are already not linguistic.  I
observe that "anvilwalrusden" is not, to my knowledge, a word in any
language.  Certainly "dyn" (my employer) is not -- the name is a
constant source of fun around the office, since virtually everybody
pronounces it "wrong" (i.e. correctly according to the
locally-prevailing norms of pronunciation).

>                                                    ii.     Eg: map Latin, Greek & Cyrillic as appropriate.
> 

This might actually not be a bad idea.

> d.      This isn’t IDNA

This BoF is not only about IDNA.

> 
> Clearly some people think that the class of the problem here is a real problem, and presumably that it is worth the effort to attempt to solve the perceived problem.

I would be satisfied if, in Dallas, we came away with a clear
delineation of the problem, or even a clear idea of how we might
delineate the problem.

Best regards,

A

-- 
Andrew Sullivan
ajs@anvilwalrusden.com