[Lucid] [mark@macchiato.com: Re: Non-normalizable diacritics - new property]

Andrew Sullivan <ajs@anvilwalrusden.com> Wed, 11 March 2015 01:33 UTC

Return-Path: <ajs@anvilwalrusden.com>
X-Original-To: lucid@ietfa.amsl.com
Delivered-To: lucid@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 9CB171A9124 for <lucid@ietfa.amsl.com>; Tue, 10 Mar 2015 18:33:56 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.242
X-Spam-Level:
X-Spam-Status: No, score=-0.242 tagged_above=-999 required=5 tests=[BAYES_20=-0.001, GB_I_LETTER=-2, HELO_MISMATCH_INFO=1.448, HOST_MISMATCH_NET=0.311] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id GFyHNh4N-I7O for <lucid@ietfa.amsl.com>; Tue, 10 Mar 2015 18:33:54 -0700 (PDT)
Received: from mx1.yitter.info (ow5p.x.rootbsd.net [208.79.81.114]) (using TLSv1 with cipher ADH-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D30931A9088 for <lucid@ietf.org>; Tue, 10 Mar 2015 18:33:53 -0700 (PDT)
Received: from dyn.com (unknown [50.189.173.0]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.yitter.info (Postfix) with ESMTPSA id 87C308A035 for <lucid@ietf.org>; Wed, 11 Mar 2015 01:33:52 +0000 (UTC)
Date: Tue, 10 Mar 2015 21:33:47 -0400
From: Andrew Sullivan <ajs@anvilwalrusden.com>
To: lucid@ietf.org
Message-ID: <20150311013300.GC12479@dyn.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
User-Agent: Mutt/1.5.23 (2014-03-12)
Archived-At: <http://mailarchive.ietf.org/arch/msg/lucid/XyU6AvtoALoz33jLLCZCBqkEgCc>
Subject: [Lucid] [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
X-BeenThere: lucid@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Locale-free UniCode Identifiers \(LUCID\)" <lucid.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lucid>, <mailto:lucid-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/lucid/>
List-Post: <mailto:lucid@ietf.org>
List-Help: <mailto:lucid-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lucid>, <mailto:lucid-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 11 Mar 2015 01:33:56 -0000

Dear colleagues,

I hereby forward a message sent to a few people (including Asmus and
me) responding to the draft Asmus and I submitted yesterday.

I think this is a valuable response, though I am not sure I agree with
the approach it's taking.  Nevertheless, if we are to have any hope of
useful discussion in Dallas, it'd be good to have a look at the issues
in advance.

Best regards,

A

----- Forwarded message from Mark Davis ?️ <mark@macchiato.com> -----

Date: Tue, 10 Mar 2015 16:53:33 +0100
From: Mark Davis ?️ <mark@macchiato.com>
To: "Asmus Freytag (t)" <asmus-inc@ix.netcom.com>
Cc: Roozbeh Pournader <roozbeh@unicode.org>, Ken Whistler <kenwhistler@att.net>, Lisa
	Moore <lisam@us.ibm.com>, Michel Suignard <michel@suignard.com>, Markus
	Scherer <markus.icu@gmail.com>, Peter Constable <petercon@microsoft.com>,
	asullivan@dyn.com
Subject: Re: Non-normalizable diacritics - new property

+ Andrew

​> ​
In the meantime, the following has been released in preparation to the BOF
at the IETF meeting in Dallas.

​> ​
https://datatracker.ietf.org/doc/draft-sullivan-lucid-prob-stmt/
<https://datatracker.ietf.org/doc/draft-sullivan-lucid-prob-stmt/>


A well written report, and I think will help a great deal in resolving the
issues.

However, I'd strongly suggest a couple of changes to be more precise, and
thus avoid confusion (no pun intended). Although these may seem overly
formal, it is vital that the text be unambiguous, because the subject is so
very tricky, and it is thus so easy for people to be arguing based on
different interpretations of terms.

I will try to suggest some minimal changes, although it might be even
better with some more extensive rewording.

> I3 depends on the assumption that strings that will be used in
>   identifiers will not have any ambiguous matching to other strings.

When discussing strings, the term "ambiguous matching" is itself quite
ambiguous. I can have ambiguous matching with SJIS, for example, because
0x61 can be an 'a' or be part of another character; similarly there are
sequences of bytes ..XY.. where XY is a character or X is the end of one
character, and Y is the start of another. There are other forms of
ambiguity as well.

To eliminate this ambiguity, change every 8 other cases of "ambiguous/ity"
to prefix by "visually".

=>   identifiers will not be visually ambiguous with other strings used as
identifiers.
...

Look for other cases that could profit by that, such as like
"indistinguishable", and "matches".

See the later:

> Worse, identifiers, by their very nature, are things that must
>   provide reliable exact matches.  The whole point of an identifier is
>   that it provides a reliable way of uniquely naming the thing to be
>   identified.

In this section, to be precise, one would need to use "visual matches".
There are so, so many other ways strings can "match", like "have the same
code points".

However, the second sentence is still problematic. The characters in two
different strings that have similar appearances *are* different characters,
so the identifiers are "unique" in that sense. I guess what you meant to
say is something like:

=> it provides a reliable way of naming the thing to be identified, so as
to have a unique visual appearance.

Now, I think adding "visual" makes it clear that this paragraph needs some
work. As it stands, it would be too strong: it would call for eliminating
all homographs from all identifiers (eg disallowing "top" written in
Cyrillic characters.)

Later, there is a shift to confusability, but the text never relates that
directly to "confusability", which is what is elaborated upon.

(In general, many different phrases are (apparently) used to mean the same
thing: matching, indistinguishable, etc. While use of different terms makes
the text flow better, it has the big disadvantage that the reader never
knows whether phrase1 is meant to have a broader or narrower scope than
phrase2, or is meant to be identical. This would be a more extensive
change, however, to use more consistent defined language. Alternatively, at
the top you could say that when you use the terms "matches",
"indistinguishable", "ambiguous", etc. (you'll have to go through and find
them all) that what is meant is in terms of visual appearance.)

> (We use the term "homoglyph" strictly: code points that normally use the
same glyph when rendered.)

This needs a bit of work. First step would be:

Strings S1 and S2 are strict homoglyphs when they are different, yet
normally use the same glyph sequence when rendered.

Second would be to clarify "normally use". Does that mean "in the fonts
that most people use on most platforms"? There are often more significant
differences in serif fonts than non-serif, for example. Does that require
that the same font is used? There is huge variation among glyphs in fonts: ţ
 and ț may look the identical in font 1, and may be discernibly different
when both are in font2, but ţ in font1 may also look identical to ț in
font2. This also gets tricky because in many systems there is font
fallback. If a character is not in font1, then it might be displayed in a
fallback font 2.

Third would be to clarify "same glyph sequence". Does this mean pixel by
pixel? At all resolutions, or just some? Pixel by pixel is extremely strict.

One could try to bring in "intent", but that is very dicy. The intent is
from the Unicode encoding side may not be followed precisely by all or even
most font vendors. So it is of only theoretical value to point to intent.

The exact degree of strictness matters a great deal, because under a
sufficiently strict interpretation <ARABIC LETTER BEH (U+0628) + ARABIC
HAMZA ABOVE (U+0654)> is *not* a strict homograph for U+08A1, ARABIC LETTER
BEH WITH HAMZA ABOVE. That is, most current fonts do not display them as
pixel-for-pixel identical.

I know well that this is a continuum, but since you are focusing on strict
homoglyphs, you have to be clearer what you mean by that term; where you
draw the line. That also carries over into related terms elsewhere in the
document like "the same glyph"; does that mean in all fonts (or
some?/most?//common?), pixel for pixel or not, at all sizes or just
body-text sizes, etc.?

You might think about providing a definition for "the same glyph sequence",
then using that in the definition for homograph, and elsewhere in the text.

> Mitigation may be as simple as using a font
   designed to distinguish among different characters.

Should mention that in practice this is extremely difficult to enforce; it
only works in closed systems, those that have complete control over the
fonts used for display of the kinds of identifiers in question.

----- End forwarded message -----

-- 
Andrew Sullivan
ajs@anvilwalrusden.com