Re: [Lucid] [mark@macchiato.com: Re: Non-normalizable diacritics - new property]

Andrew Sullivan <ajs@anvilwalrusden.com> Wed, 11 March 2015 20:09 UTC

Return-Path: <ajs@anvilwalrusden.com>
X-Original-To: lucid@ietfa.amsl.com
Delivered-To: lucid@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id AE4B31A700F for <lucid@ietfa.amsl.com>; Wed, 11 Mar 2015 13:09:49 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.141
X-Spam-Level:
X-Spam-Status: No, score=-2.141 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, GB_I_LETTER=-2, HELO_MISMATCH_INFO=1.448, HOST_MISMATCH_NET=0.311] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id oQb9g38QEVs2 for <lucid@ietfa.amsl.com>; Wed, 11 Mar 2015 13:09:44 -0700 (PDT)
Received: from mx1.yitter.info (ow5p.x.rootbsd.net [208.79.81.114]) (using TLSv1 with cipher ADH-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 67DB11A1AB9 for <lucid@ietf.org>; Wed, 11 Mar 2015 13:09:44 -0700 (PDT)
Received: from mx1.yitter.info (unknown [50.189.173.0]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.yitter.info (Postfix) with ESMTPSA id 2CDF98A031 for <lucid@ietf.org>; Wed, 11 Mar 2015 20:09:43 +0000 (UTC)
Date: Wed, 11 Mar 2015 16:09:41 -0400
From: Andrew Sullivan <ajs@anvilwalrusden.com>
To: lucid@ietf.org
Message-ID: <20150311200941.GV15037@mx1.yitter.info>
References: <20150311013300.GC12479@dyn.com> <CA+9kkMDZW9yPtDxtLTfY1=VS6itvHtXHF1qdZKtXdwwORwqnew@mail.gmail.com> <55008F97.8040701@ix.netcom.com> <CA+9kkMAcgSA1Ch0B9W1Np0LMn2udegZ=AzU1b26dAi+SDcbGgg@mail.gmail.com> <CY1PR0301MB07310C68F6CFDD46AE22086F82190@CY1PR0301MB0731.namprd03.prod.outlook.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CY1PR0301MB07310C68F6CFDD46AE22086F82190@CY1PR0301MB0731.namprd03.prod.outlook.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Archived-At: <http://mailarchive.ietf.org/arch/msg/lucid/mjRhjHuihrbp9lDynCYR5vafB4Y>
Subject: Re: [Lucid] [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
X-BeenThere: lucid@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Locale-free UniCode Identifiers \(LUCID\)" <lucid.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lucid>, <mailto:lucid-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/lucid/>
List-Post: <mailto:lucid@ietf.org>
List-Help: <mailto:lucid-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lucid>, <mailto:lucid-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 11 Mar 2015 20:09:49 -0000

On Wed, Mar 11, 2015 at 07:43:12PM +0000, Shawn Steele wrote:
> This document makes a hard line between “homoglyphs” and “visually similar”.

No, it does not.  It says right there in section 2.2.1:

       Any character that can be confused for
   another one can be called confusable, and confusability can be
   thought of as a spectrum with "visually similar" at one end, and
   "homoglyphs" at the other.  (We use the term "homoglyph" strictly:
   code points that normally use the same glyph when rendered.)

> The ʻokina is a decent case where it looks a lot like another character, and often fonts may even use the same glyph, however sometimes font designers choose to make a distinction.  It’s nearly impossible to tell a developer to “use the right font”.
> 

It's funny that you should pick ʻokina, because you're sort of making
the point the draft is after.  In any properly-designed font, U+02BB
will be rendered as though it is a single opening curly-quote, in the
way English (not American) quotation marks were historically typeset.
But U+02BC (ʼ) is a letter, and it's actually called "APOSTROPHE", and
in any decent font will look like the single closing curly-quote,
which is what was historically used in English for the apostrophe as
well.  This is not U+0027 ('), of course.  So all three can be
distinguished in a proper font.  There's moreover an argument to be
made that U+0027 is close to U+02BC than it is to U+02BB, though of
course the context might make a difference.  (Also, of course, U+0027
has other different properties.)

None of this is like the case of, say, the precomposed e-with-acute
and the combining sequence, which should never even in principle show
a difference.  That particular case is solved by NFC, but it is
clearly different from the cases where a font should or could, in
principle, make them distinguishable.

I think we're all aware that you can't tell developers what font to
use, but there is clearly an in-principle difference between "could be
mitigated with font" and "cannot possibly be mitigated with font".
And as the draft notes, this is all a spectrum with many small
gradations.  The purpose of the discussion is to make distinctions
apparent where we can, so that we can talk sensibly about them.  So
trying to say they're all the same doesn't help make those
distinctions clear.

> Additionally it continues to treat these newly noticed characters as a special case without considering the many existing problems.
>

Where, please?

> I’m also confused by the document’s attention to the need for unique identifiers at the beginning, but then looks at the existing IDNA problem.  I don’t consider IDNA able to provide “secure” (meaning unconfusable) identifiers.
> 

IDNA is supposed to be providing unique identifiers.  It in fact does,
in the sense that when a series of U-labels or A-labels are put
together they (respectively) produce exactly one FQDN that can be
looked up.  That's why it's part of the problem.

> 
> IMO the reason to solve the problem with this character

Which character, exactly?

Best regards,

A
-- 
Andrew Sullivan
ajs@anvilwalrusden.com