Re: [Lucid] [mark@macchiato.com: Re: Non-normalizable diacritics - new property]

Ted Hardie <ted.ietf@gmail.com> Wed, 11 March 2015 17:58 UTC

Return-Path: <ted.ietf@gmail.com>
X-Original-To: lucid@ietfa.amsl.com
Delivered-To: lucid@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 334B61A1B28 for <lucid@ietfa.amsl.com>; Wed, 11 Mar 2015 10:58:20 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.999
X-Spam-Level:
X-Spam-Status: No, score=-3.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, GB_I_LETTER=-2, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LJgG-9u7xkJ0 for <lucid@ietfa.amsl.com>; Wed, 11 Mar 2015 10:58:16 -0700 (PDT)
Received: from mail-ie0-x232.google.com (mail-ie0-x232.google.com [IPv6:2607:f8b0:4001:c03::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id E30601A0211 for <lucid@ietf.org>; Wed, 11 Mar 2015 10:58:15 -0700 (PDT)
Received: by iecsl2 with SMTP id sl2so459246iec.1 for <lucid@ietf.org>; Wed, 11 Mar 2015 10:58:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=FEIF8I8LefCgBVNZ+zDMIVvs8Bq6/q0vjcqcoDUwErM=; b=wFABt0HlAmGde9M3UeihxyCE5JANQWGMzbqoGJ07nAjDM/2TrtVmPDxbco3mzwlYmR vwUANfmLCNzke9VKUBnZeelbWJDUUPEFL5vftpygHpZ5788K0WDbOUhUwHUJfQ14eucZ EIiS+pOKWLw4enqeOQS8j2vaXDY2EbeK1OLgwRQQuTA3qox1KLygOuuWZhPdn2OXQbzu mXJs8xxQrH/2VuaI0126tFg1q+eJ+pmpZ7jNRUxFhBxqDf+NmWe6lBZEsrXuPkKwuaRq NVKrcYiruxbHGVbEaNhppFcbnnEjHSYaFdHgJFc3X4fj0Fyx5tDjcF2BvMAG30rTTUEx pPhQ==
MIME-Version: 1.0
X-Received: by 10.50.43.130 with SMTP id w2mr66933882igl.30.1426096695199; Wed, 11 Mar 2015 10:58:15 -0700 (PDT)
Received: by 10.42.129.17 with HTTP; Wed, 11 Mar 2015 10:58:15 -0700 (PDT)
In-Reply-To: <20150311013300.GC12479@dyn.com>
References: <20150311013300.GC12479@dyn.com>
Date: Wed, 11 Mar 2015 10:58:15 -0700
Message-ID: <CA+9kkMDZW9yPtDxtLTfY1=VS6itvHtXHF1qdZKtXdwwORwqnew@mail.gmail.com>
From: Ted Hardie <ted.ietf@gmail.com>
To: Andrew Sullivan <ajs@anvilwalrusden.com>
Content-Type: multipart/alternative; boundary="089e0103de4eeecef6051107024d"
Archived-At: <http://mailarchive.ietf.org/arch/msg/lucid/PObGbYv_2MBxqZq61SEd7HI1pWM>
Cc: lucid@ietf.org
Subject: Re: [Lucid] [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
X-BeenThere: lucid@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Locale-free UniCode Identifiers \(LUCID\)" <lucid.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lucid>, <mailto:lucid-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/lucid/>
List-Post: <mailto:lucid@ietf.org>
List-Help: <mailto:lucid-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lucid>, <mailto:lucid-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 11 Mar 2015 17:58:20 -0000

Hi Andrew,

So, I'm far from an expert on this, but I have some concerns about the
proposal to use "visually" here in the way Mark proposes.  If I understand
things correctly, this problem will also occur in any system that reads a
sequence of characters aloud.  That is, there is no way to distinguish in
an audio system between U+08A1 and the combination of U+0628 and U+0654.  The
current draft talks about rendering the glyph, and I think that is closer
to correct--it doesn't matter whether the rendering is audio or graphical
for the problem to occur.  In the audio forms, the  quality of  running
text providing context vs. independent identifiers without it also seems
similar, at least to me.

I also don't believe that Mark's point about  parsing of byte sequences as
XY when X and Y or meant is quite correct; in string contexts that would be
a parsing error.  It might not be easily distinguished on the wire, but in
protocol slots, this seems to be an at best unrelated issue.  I may not be
familiar, though, with contexts in which this may occur.

regards,

Ted

On Tue, Mar 10, 2015 at 6:33 PM, Andrew Sullivan <ajs@anvilwalrusden.com>
wrote:

> Dear colleagues,
>
> I hereby forward a message sent to a few people (including Asmus and
> me) responding to the draft Asmus and I submitted yesterday.
>
> I think this is a valuable response, though I am not sure I agree with
> the approach it's taking.  Nevertheless, if we are to have any hope of
> useful discussion in Dallas, it'd be good to have a look at the issues
> in advance.
>
> Best regards,
>
> A
>
> ----- Forwarded message from Mark Davis ?️ <mark@macchiato.com> -----
>
> Date: Tue, 10 Mar 2015 16:53:33 +0100
> From: Mark Davis ?️ <mark@macchiato.com>
> To: "Asmus Freytag (t)" <asmus-inc@ix.netcom.com>
> Cc: Roozbeh Pournader <roozbeh@unicode.org>, Ken Whistler <
> kenwhistler@att.net>, Lisa
>         Moore <lisam@us.ibm.com>, Michel Suignard <michel@suignard.com>,
> Markus
>         Scherer <markus.icu@gmail.com>, Peter Constable <
> petercon@microsoft.com>,
>         asullivan@dyn.com
> Subject: Re: Non-normalizable diacritics - new property
>
> + Andrew
>
> ​> ​
> In the meantime, the following has been released in preparation to the BOF
> at the IETF meeting in Dallas.
>
> ​> ​
> https://datatracker.ietf.org/doc/draft-sullivan-lucid-prob-stmt/
> <https://datatracker.ietf.org/doc/draft-sullivan-lucid-prob-stmt/>
>
>
> A well written report, and I think will help a great deal in resolving the
> issues.
>
> However, I'd strongly suggest a couple of changes to be more precise, and
> thus avoid confusion (no pun intended). Although these may seem overly
> formal, it is vital that the text be unambiguous, because the subject is so
> very tricky, and it is thus so easy for people to be arguing based on
> different interpretations of terms.
>
> I will try to suggest some minimal changes, although it might be even
> better with some more extensive rewording.
>
> > I3 depends on the assumption that strings that will be used in
> >   identifiers will not have any ambiguous matching to other strings.
>
> When discussing strings, the term "ambiguous matching" is itself quite
> ambiguous. I can have ambiguous matching with SJIS, for example, because
> 0x61 can be an 'a' or be part of another character; similarly there are
> sequences of bytes ..XY.. where XY is a character or X is the end of one
> character, and Y is the start of another. There are other forms of
> ambiguity as well.
>
> To eliminate this ambiguity, change every 8 other cases of "ambiguous/ity"
> to prefix by "visually".
>
> =>   identifiers will not be visually ambiguous with other strings used as
> identifiers.
> ...
>
> Look for other cases that could profit by that, such as like
> "indistinguishable", and "matches".
>
> See the later:
>
> > Worse, identifiers, by their very nature, are things that must
> >   provide reliable exact matches.  The whole point of an identifier is
> >   that it provides a reliable way of uniquely naming the thing to be
> >   identified.
>
> In this section, to be precise, one would need to use "visual matches".
> There are so, so many other ways strings can "match", like "have the same
> code points".
>
> However, the second sentence is still problematic. The characters in two
> different strings that have similar appearances *are* different characters,
> so the identifiers are "unique" in that sense. I guess what you meant to
> say is something like:
>
> => it provides a reliable way of naming the thing to be identified, so as
> to have a unique visual appearance.
>
> Now, I think adding "visual" makes it clear that this paragraph needs some
> work. As it stands, it would be too strong: it would call for eliminating
> all homographs from all identifiers (eg disallowing "top" written in
> Cyrillic characters.)
>
> Later, there is a shift to confusability, but the text never relates that
> directly to "confusability", which is what is elaborated upon.
>
> (In general, many different phrases are (apparently) used to mean the same
> thing: matching, indistinguishable, etc. While use of different terms makes
> the text flow better, it has the big disadvantage that the reader never
> knows whether phrase1 is meant to have a broader or narrower scope than
> phrase2, or is meant to be identical. This would be a more extensive
> change, however, to use more consistent defined language. Alternatively, at
> the top you could say that when you use the terms "matches",
> "indistinguishable", "ambiguous", etc. (you'll have to go through and find
> them all) that what is meant is in terms of visual appearance.)
>
> > (We use the term "homoglyph" strictly: code points that normally use the
> same glyph when rendered.)
>
> This needs a bit of work. First step would be:
>
> Strings S1 and S2 are strict homoglyphs when they are different, yet
> normally use the same glyph sequence when rendered.
>
> Second would be to clarify "normally use". Does that mean "in the fonts
> that most people use on most platforms"? There are often more significant
> differences in serif fonts than non-serif, for example. Does that require
> that the same font is used? There is huge variation among glyphs in fonts:
> ţ
>  and ț may look the identical in font 1, and may be discernibly different
> when both are in font2, but ţ in font1 may also look identical to ț in
> font2. This also gets tricky because in many systems there is font
> fallback. If a character is not in font1, then it might be displayed in a
> fallback font 2.
>
> Third would be to clarify "same glyph sequence". Does this mean pixel by
> pixel? At all resolutions, or just some? Pixel by pixel is extremely
> strict.
>
> One could try to bring in "intent", but that is very dicy. The intent is
> from the Unicode encoding side may not be followed precisely by all or even
> most font vendors. So it is of only theoretical value to point to intent.
>
> The exact degree of strictness matters a great deal, because under a
> sufficiently strict interpretation <ARABIC LETTER BEH (U+0628) + ARABIC
> HAMZA ABOVE (U+0654)> is *not* a strict homograph for U+08A1, ARABIC LETTER
> BEH WITH HAMZA ABOVE. That is, most current fonts do not display them as
> pixel-for-pixel identical.
>
> I know well that this is a continuum, but since you are focusing on strict
> homoglyphs, you have to be clearer what you mean by that term; where you
> draw the line. That also carries over into related terms elsewhere in the
> document like "the same glyph"; does that mean in all fonts (or
> some?/most?//common?), pixel for pixel or not, at all sizes or just
> body-text sizes, etc.?
>
> You might think about providing a definition for "the same glyph sequence",
> then using that in the definition for homograph, and elsewhere in the text.
>
> > Mitigation may be as simple as using a font
>    designed to distinguish among different characters.
>
> Should mention that in practice this is extremely difficult to enforce; it
> only works in closed systems, those that have complete control over the
> fonts used for display of the kinds of identifiers in question.
>
> ----- End forwarded message -----
>
> --
> Andrew Sullivan
> ajs@anvilwalrusden.com
>
> _______________________________________________
> Lucid mailing list
> Lucid@ietf.org
> https://www.ietf.org/mailman/listinfo/lucid
>