[Lucid] [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
Andrew Sullivan <ajs@anvilwalrusden.com> Wed, 11 March 2015 01:33 UTC
Return-Path: <ajs@anvilwalrusden.com>
X-Original-To: lucid@ietfa.amsl.com
Delivered-To: lucid@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 9CB171A9124 for <lucid@ietfa.amsl.com>; Tue, 10 Mar 2015 18:33:56 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.242
X-Spam-Level:
X-Spam-Status: No, score=-0.242 tagged_above=-999 required=5 tests=[BAYES_20=-0.001, GB_I_LETTER=-2, HELO_MISMATCH_INFO=1.448, HOST_MISMATCH_NET=0.311] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id GFyHNh4N-I7O for <lucid@ietfa.amsl.com>; Tue, 10 Mar 2015 18:33:54 -0700 (PDT)
Received: from mx1.yitter.info (ow5p.x.rootbsd.net [208.79.81.114]) (using TLSv1 with cipher ADH-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D30931A9088 for <lucid@ietf.org>; Tue, 10 Mar 2015 18:33:53 -0700 (PDT)
Received: from dyn.com (unknown [50.189.173.0]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.yitter.info (Postfix) with ESMTPSA id 87C308A035 for <lucid@ietf.org>; Wed, 11 Mar 2015 01:33:52 +0000 (UTC)
Date: Tue, 10 Mar 2015 21:33:47 -0400
From: Andrew Sullivan <ajs@anvilwalrusden.com>
To: lucid@ietf.org
Message-ID: <20150311013300.GC12479@dyn.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
User-Agent: Mutt/1.5.23 (2014-03-12)
Archived-At: <http://mailarchive.ietf.org/arch/msg/lucid/XyU6AvtoALoz33jLLCZCBqkEgCc>
Subject: [Lucid] [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
X-BeenThere: lucid@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Locale-free UniCode Identifiers \(LUCID\)" <lucid.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lucid>, <mailto:lucid-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/lucid/>
List-Post: <mailto:lucid@ietf.org>
List-Help: <mailto:lucid-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lucid>, <mailto:lucid-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 11 Mar 2015 01:33:56 -0000
Dear colleagues, I hereby forward a message sent to a few people (including Asmus and me) responding to the draft Asmus and I submitted yesterday. I think this is a valuable response, though I am not sure I agree with the approach it's taking. Nevertheless, if we are to have any hope of useful discussion in Dallas, it'd be good to have a look at the issues in advance. Best regards, A ----- Forwarded message from Mark Davis ?️ <mark@macchiato.com> ----- Date: Tue, 10 Mar 2015 16:53:33 +0100 From: Mark Davis ?️ <mark@macchiato.com> To: "Asmus Freytag (t)" <asmus-inc@ix.netcom.com> Cc: Roozbeh Pournader <roozbeh@unicode.org>, Ken Whistler <kenwhistler@att.net>, Lisa Moore <lisam@us.ibm.com>, Michel Suignard <michel@suignard.com>, Markus Scherer <markus.icu@gmail.com>, Peter Constable <petercon@microsoft.com>, asullivan@dyn.com Subject: Re: Non-normalizable diacritics - new property + Andrew > In the meantime, the following has been released in preparation to the BOF at the IETF meeting in Dallas. > https://datatracker.ietf.org/doc/draft-sullivan-lucid-prob-stmt/ <https://datatracker.ietf.org/doc/draft-sullivan-lucid-prob-stmt/> A well written report, and I think will help a great deal in resolving the issues. However, I'd strongly suggest a couple of changes to be more precise, and thus avoid confusion (no pun intended). Although these may seem overly formal, it is vital that the text be unambiguous, because the subject is so very tricky, and it is thus so easy for people to be arguing based on different interpretations of terms. I will try to suggest some minimal changes, although it might be even better with some more extensive rewording. > I3 depends on the assumption that strings that will be used in > identifiers will not have any ambiguous matching to other strings. When discussing strings, the term "ambiguous matching" is itself quite ambiguous. I can have ambiguous matching with SJIS, for example, because 0x61 can be an 'a' or be part of another character; similarly there are sequences of bytes ..XY.. where XY is a character or X is the end of one character, and Y is the start of another. There are other forms of ambiguity as well. To eliminate this ambiguity, change every 8 other cases of "ambiguous/ity" to prefix by "visually". => identifiers will not be visually ambiguous with other strings used as identifiers. ... Look for other cases that could profit by that, such as like "indistinguishable", and "matches". See the later: > Worse, identifiers, by their very nature, are things that must > provide reliable exact matches. The whole point of an identifier is > that it provides a reliable way of uniquely naming the thing to be > identified. In this section, to be precise, one would need to use "visual matches". There are so, so many other ways strings can "match", like "have the same code points". However, the second sentence is still problematic. The characters in two different strings that have similar appearances *are* different characters, so the identifiers are "unique" in that sense. I guess what you meant to say is something like: => it provides a reliable way of naming the thing to be identified, so as to have a unique visual appearance. Now, I think adding "visual" makes it clear that this paragraph needs some work. As it stands, it would be too strong: it would call for eliminating all homographs from all identifiers (eg disallowing "top" written in Cyrillic characters.) Later, there is a shift to confusability, but the text never relates that directly to "confusability", which is what is elaborated upon. (In general, many different phrases are (apparently) used to mean the same thing: matching, indistinguishable, etc. While use of different terms makes the text flow better, it has the big disadvantage that the reader never knows whether phrase1 is meant to have a broader or narrower scope than phrase2, or is meant to be identical. This would be a more extensive change, however, to use more consistent defined language. Alternatively, at the top you could say that when you use the terms "matches", "indistinguishable", "ambiguous", etc. (you'll have to go through and find them all) that what is meant is in terms of visual appearance.) > (We use the term "homoglyph" strictly: code points that normally use the same glyph when rendered.) This needs a bit of work. First step would be: Strings S1 and S2 are strict homoglyphs when they are different, yet normally use the same glyph sequence when rendered. Second would be to clarify "normally use". Does that mean "in the fonts that most people use on most platforms"? There are often more significant differences in serif fonts than non-serif, for example. Does that require that the same font is used? There is huge variation among glyphs in fonts: ţ and ț may look the identical in font 1, and may be discernibly different when both are in font2, but ţ in font1 may also look identical to ț in font2. This also gets tricky because in many systems there is font fallback. If a character is not in font1, then it might be displayed in a fallback font 2. Third would be to clarify "same glyph sequence". Does this mean pixel by pixel? At all resolutions, or just some? Pixel by pixel is extremely strict. One could try to bring in "intent", but that is very dicy. The intent is from the Unicode encoding side may not be followed precisely by all or even most font vendors. So it is of only theoretical value to point to intent. The exact degree of strictness matters a great deal, because under a sufficiently strict interpretation <ARABIC LETTER BEH (U+0628) + ARABIC HAMZA ABOVE (U+0654)> is *not* a strict homograph for U+08A1, ARABIC LETTER BEH WITH HAMZA ABOVE. That is, most current fonts do not display them as pixel-for-pixel identical. I know well that this is a continuum, but since you are focusing on strict homoglyphs, you have to be clearer what you mean by that term; where you draw the line. That also carries over into related terms elsewhere in the document like "the same glyph"; does that mean in all fonts (or some?/most?//common?), pixel for pixel or not, at all sizes or just body-text sizes, etc.? You might think about providing a definition for "the same glyph sequence", then using that in the definition for homograph, and elsewhere in the text. > Mitigation may be as simple as using a font designed to distinguish among different characters. Should mention that in practice this is extremely difficult to enforce; it only works in closed systems, those that have complete control over the fonts used for display of the kinds of identifiers in question. ----- End forwarded message ----- -- Andrew Sullivan ajs@anvilwalrusden.com
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Asmus Freytag
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Andrew Sullivan
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Andrew Sullivan
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… John C Klensin
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… John C Klensin
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Asmus Freytag
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… John C Klensin
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Andrew Sullivan
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Asmus Freytag
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… John C Klensin
- [Lucid] [mark@macchiato.com: Re: Non-normalizable… Andrew Sullivan
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… Ted Hardie
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… Ted Hardie
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… Shawn Steele
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… Andrew Sullivan
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… John C Klensin
- [Lucid] FW: [mark@macchiato.com: Re: Non-normaliz… Shawn Steele