Re: [Lucid] [mark@macchiato.com: Re: Non-normalizable diacritics - new property]

Ted Hardie <ted.ietf@gmail.com> Wed, 11 March 2015 19:15 UTC

Return-Path: <ted.ietf@gmail.com>
X-Original-To: lucid@ietfa.amsl.com
Delivered-To: lucid@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 675331A017C for <lucid@ietfa.amsl.com>; Wed, 11 Mar 2015 12:15:44 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.999
X-Spam-Level:
X-Spam-Status: No, score=-3.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, GB_I_LETTER=-2, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LpsHPOlUdjsG for <lucid@ietfa.amsl.com>; Wed, 11 Mar 2015 12:15:40 -0700 (PDT)
Received: from mail-ig0-x22b.google.com (mail-ig0-x22b.google.com [IPv6:2607:f8b0:4001:c05::22b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 20FFF1A6ED9 for <lucid@ietf.org>; Wed, 11 Mar 2015 12:15:40 -0700 (PDT)
Received: by igbhl2 with SMTP id hl2so43031454igb.3 for <lucid@ietf.org>; Wed, 11 Mar 2015 12:15:39 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=4BnD9CtaXT3fjhvyS+5ryCkls/xfbWhZ10D2LXhTBRg=; b=c0rYvShtQ9z6bC5yza6faVa/tJ8tj6PeYe3D+uTxUUEQPwB+mA0TjCj/ji/pVq8iCQ BhDsVhIx3SzMfamq3JjQGQw66uVrkQjXIid4VF71eUFzSk67TaVqgIFNDguwjffTN/me Hq2xDW/ExzhLCLBFudrSCUVh/XyRtJSE/qZ3ijLFxAs/xYHO7MdIlKihPJc9bYcnPopx Gcf3OCnX0nQChlvyOWXm0sqs8Ngp0YTzWSa48yn2W0Z5Qm0y25AmcRyy0I9LKOWfKrDm nFhck+0NDWIrGEJZs6PWi0E5v+GfxS7+NHcjuwf13Fq2r+sXWwOnxircGrBST+twNGGC sMmQ==
MIME-Version: 1.0
X-Received: by 10.42.41.148 with SMTP id p20mr42904755ice.62.1426101339498; Wed, 11 Mar 2015 12:15:39 -0700 (PDT)
Received: by 10.42.129.17 with HTTP; Wed, 11 Mar 2015 12:15:39 -0700 (PDT)
In-Reply-To: <55008F97.8040701@ix.netcom.com>
References: <20150311013300.GC12479@dyn.com> <CA+9kkMDZW9yPtDxtLTfY1=VS6itvHtXHF1qdZKtXdwwORwqnew@mail.gmail.com> <55008F97.8040701@ix.netcom.com>
Date: Wed, 11 Mar 2015 12:15:39 -0700
Message-ID: <CA+9kkMAcgSA1Ch0B9W1Np0LMn2udegZ=AzU1b26dAi+SDcbGgg@mail.gmail.com>
From: Ted Hardie <ted.ietf@gmail.com>
To: "Asmus Freytag (t)" <asmus-inc@ix.netcom.com>
Content-Type: multipart/alternative; boundary="20cf301d3a28c12e43051108179a"
Archived-At: <http://mailarchive.ietf.org/arch/msg/lucid/AGO563EHmipmURMZiGw-Drh2ncc>
Cc: lucid@ietf.org, Andrew Sullivan <ajs@anvilwalrusden.com>
Subject: Re: [Lucid] [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
X-BeenThere: lucid@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Locale-free UniCode Identifiers \(LUCID\)" <lucid.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lucid>, <mailto:lucid-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/lucid/>
List-Post: <mailto:lucid@ietf.org>
List-Help: <mailto:lucid-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lucid>, <mailto:lucid-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 11 Mar 2015 19:15:44 -0000

On Wed, Mar 11, 2015 at 11:55 AM, Asmus Freytag (t) <asmus-inc@ix.netcom.com
> wrote:

>  On 3/11/2015 10:58 AM, Ted Hardie wrote:
>
>  Hi Andrew,
>
>  So, I'm far from an expert on this, but I have some concerns about the
> proposal to use "visually" here in the way Mark proposes.  If I understand
> things correctly, this problem will also occur in any system that reads a
> sequence of characters aloud.  That is, there is no way to distinguish in
> an audio system between U+08A1 and the combination of U+0628 and U+0654.
>
>
> Where do get that?
>
> U+0654 is used to represent a glottal stop, but my understanding is that
> U+08A1 was encoded partially because it represents a sound that does not
> have a glottal stop. If my understanding is correct, the distinction should
> be quite audible.
>

I asked a native Arabic speaker about the systems she used and she played
them to check.  Those systems may be non-representative, though, or simply
wrong.  Or, rather than wrong, they may be tuned to a case in which the
result isn't audible.

Are we confident that this difference is general for the other cases?  If
so, that would be a very powerful way to explain the problem--code points
which are visually indistinguishable but would render differently in
audio.  (If it is not general, we still may want be cautious about focusing
solely on visual renderings).

regards,

Ted



>   The current draft talks about rendering the glyph, and I think that is
> closer to correct--it doesn't matter whether the rendering is audio or
> graphical for the problem to occur.  In the audio forms, the  quality of
> running text providing context vs. independent identifiers without it also
> seems similar, at least to me.
>
>
> I think that the focus on visual as suggested by Mark ignores that there
> are identifiers for which the user community claims that certain visually
> distinct representations are considered "the same" by users, to the point
> that one cannot rely on users to faithfully retain which version of the
> identifier is the desired one.
>
> As the problem statement is concerned with general background, it is
> important to recognize such other levels of ambiguity exist.
>
> A./
>
>
>  I also don't believe that Mark's point about  parsing of byte sequences
> as XY when X and Y or meant is quite correct; in string contexts that would
> be a parsing error.  It might not be easily distinguished on the wire, but
> in protocol slots, this seems to be an at best unrelated issue.  I may not
> be familiar, though, with contexts in which this may occur.
>
>  regards,
>
>  Ted
>
> On Tue, Mar 10, 2015 at 6:33 PM, Andrew Sullivan <ajs@anvilwalrusden.com>
> wrote:
>
>> Dear colleagues,
>>
>> I hereby forward a message sent to a few people (including Asmus and
>> me) responding to the draft Asmus and I submitted yesterday.
>>
>> I think this is a valuable response, though I am not sure I agree with
>> the approach it's taking.  Nevertheless, if we are to have any hope of
>> useful discussion in Dallas, it'd be good to have a look at the issues
>> in advance.
>>
>> Best regards,
>>
>> A
>>
>> ----- Forwarded message from Mark Davis ?️ <mark@macchiato.com> -----
>>
>> Date: Tue, 10 Mar 2015 16:53:33 +0100
>> From: Mark Davis ?️ <mark@macchiato.com>
>> To: "Asmus Freytag (t)" <asmus-inc@ix.netcom.com>
>> Cc: Roozbeh Pournader <roozbeh@unicode.org>, Ken Whistler <
>> kenwhistler@att.net>, Lisa
>>         Moore <lisam@us.ibm.com>, Michel Suignard <michel@suignard.com>,
>> Markus
>>         Scherer <markus.icu@gmail.com>, Peter Constable <
>> petercon@microsoft.com>,
>>         asullivan@dyn.com
>> Subject: Re: Non-normalizable diacritics - new property
>>
>> + Andrew
>>
>> ​> ​
>> In the meantime, the following has been released in preparation to the BOF
>> at the IETF meeting in Dallas.
>>
>> ​> ​
>> https://datatracker.ietf.org/doc/draft-sullivan-lucid-prob-stmt/
>> <https://datatracker.ietf.org/doc/draft-sullivan-lucid-prob-stmt/>
>>
>>
>> A well written report, and I think will help a great deal in resolving the
>> issues.
>>
>> However, I'd strongly suggest a couple of changes to be more precise, and
>> thus avoid confusion (no pun intended). Although these may seem overly
>> formal, it is vital that the text be unambiguous, because the subject is
>> so
>> very tricky, and it is thus so easy for people to be arguing based on
>> different interpretations of terms.
>>
>> I will try to suggest some minimal changes, although it might be even
>> better with some more extensive rewording.
>>
>> > I3 depends on the assumption that strings that will be used in
>> >   identifiers will not have any ambiguous matching to other strings.
>>
>> When discussing strings, the term "ambiguous matching" is itself quite
>> ambiguous. I can have ambiguous matching with SJIS, for example, because
>> 0x61 can be an 'a' or be part of another character; similarly there are
>> sequences of bytes ..XY.. where XY is a character or X is the end of one
>> character, and Y is the start of another. There are other forms of
>> ambiguity as well.
>>
>> To eliminate this ambiguity, change every 8 other cases of "ambiguous/ity"
>> to prefix by "visually".
>>
>> =>   identifiers will not be visually ambiguous with other strings used as
>> identifiers.
>> ...
>>
>> Look for other cases that could profit by that, such as like
>> "indistinguishable", and "matches".
>>
>> See the later:
>>
>> > Worse, identifiers, by their very nature, are things that must
>> >   provide reliable exact matches.  The whole point of an identifier is
>> >   that it provides a reliable way of uniquely naming the thing to be
>> >   identified.
>>
>> In this section, to be precise, one would need to use "visual matches".
>> There are so, so many other ways strings can "match", like "have the same
>> code points".
>>
>> However, the second sentence is still problematic. The characters in two
>> different strings that have similar appearances *are* different
>> characters,
>> so the identifiers are "unique" in that sense. I guess what you meant to
>> say is something like:
>>
>> => it provides a reliable way of naming the thing to be identified, so as
>> to have a unique visual appearance.
>>
>> Now, I think adding "visual" makes it clear that this paragraph needs some
>> work. As it stands, it would be too strong: it would call for eliminating
>> all homographs from all identifiers (eg disallowing "top" written in
>> Cyrillic characters.)
>>
>> Later, there is a shift to confusability, but the text never relates that
>> directly to "confusability", which is what is elaborated upon.
>>
>> (In general, many different phrases are (apparently) used to mean the same
>> thing: matching, indistinguishable, etc. While use of different terms
>> makes
>> the text flow better, it has the big disadvantage that the reader never
>> knows whether phrase1 is meant to have a broader or narrower scope than
>> phrase2, or is meant to be identical. This would be a more extensive
>> change, however, to use more consistent defined language. Alternatively,
>> at
>> the top you could say that when you use the terms "matches",
>> "indistinguishable", "ambiguous", etc. (you'll have to go through and find
>> them all) that what is meant is in terms of visual appearance.)
>>
>> > (We use the term "homoglyph" strictly: code points that normally use the
>> same glyph when rendered.)
>>
>> This needs a bit of work. First step would be:
>>
>> Strings S1 and S2 are strict homoglyphs when they are different, yet
>> normally use the same glyph sequence when rendered.
>>
>> Second would be to clarify "normally use". Does that mean "in the fonts
>> that most people use on most platforms"? There are often more significant
>> differences in serif fonts than non-serif, for example. Does that require
>> that the same font is used? There is huge variation among glyphs in
>> fonts: ţ
>>  and ț may look the identical in font 1, and may be discernibly different
>> when both are in font2, but ţ in font1 may also look identical to ț in
>> font2. This also gets tricky because in many systems there is font
>> fallback. If a character is not in font1, then it might be displayed in a
>> fallback font 2.
>>
>> Third would be to clarify "same glyph sequence". Does this mean pixel by
>> pixel? At all resolutions, or just some? Pixel by pixel is extremely
>> strict.
>>
>> One could try to bring in "intent", but that is very dicy. The intent is
>> from the Unicode encoding side may not be followed precisely by all or
>> even
>> most font vendors. So it is of only theoretical value to point to intent.
>>
>> The exact degree of strictness matters a great deal, because under a
>> sufficiently strict interpretation <ARABIC LETTER BEH (U+0628) + ARABIC
>> HAMZA ABOVE (U+0654)> is *not* a strict homograph for U+08A1, ARABIC
>> LETTER
>> BEH WITH HAMZA ABOVE. That is, most current fonts do not display them as
>> pixel-for-pixel identical.
>>
>> I know well that this is a continuum, but since you are focusing on strict
>> homoglyphs, you have to be clearer what you mean by that term; where you
>> draw the line. That also carries over into related terms elsewhere in the
>> document like "the same glyph"; does that mean in all fonts (or
>> some?/most?//common?), pixel for pixel or not, at all sizes or just
>> body-text sizes, etc.?
>>
>> You might think about providing a definition for "the same glyph
>> sequence",
>> then using that in the definition for homograph, and elsewhere in the
>> text.
>>
>> > Mitigation may be as simple as using a font
>>    designed to distinguish among different characters.
>>
>> Should mention that in practice this is extremely difficult to enforce; it
>> only works in closed systems, those that have complete control over the
>> fonts used for display of the kinds of identifiers in question.
>>
>> ----- End forwarded message -----
>>
>> --
>> Andrew Sullivan
>> ajs@anvilwalrusden.com
>>
>> _______________________________________________
>> Lucid mailing list
>> Lucid@ietf.org
>> https://www.ietf.org/mailman/listinfo/lucid
>>
>
>
>
> _______________________________________________
> Lucid mailing listLucid@ietf.orghttps://www.ietf.org/mailman/listinfo/lucid
>
>
>