Re: [Lucid] [mark@macchiato.com: Re: Non-normalizable diacritics - new property]

John C Klensin <john-ietf@jck.com> Wed, 11 March 2015 21:15 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: lucid@ietfa.amsl.com
Delivered-To: lucid@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A2F3C1A8786 for <lucid@ietfa.amsl.com>; Wed, 11 Mar 2015 14:15:48 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.61
X-Spam-Level:
X-Spam-Status: No, score=-2.61 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, T_RP_MATCHES_RCVD=-0.01] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id AZtghPQaDR1a for <lucid@ietfa.amsl.com>; Wed, 11 Mar 2015 14:15:46 -0700 (PDT)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 90CA61A878A for <lucid@ietf.org>; Wed, 11 Mar 2015 14:15:39 -0700 (PDT)
Received: from [198.252.137.35] (helo=JcK-HP8200.jck.com) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1YVnyW-0001O2-Ce; Wed, 11 Mar 2015 17:15:32 -0400
Date: Wed, 11 Mar 2015 17:15:27 -0400
From: John C Klensin <john-ietf@jck.com>
To: Ted Hardie <ted.ietf@gmail.com>, Andrew Sullivan <ajs@anvilwalrusden.com>
Message-ID: <4A00B59133258EB59C38CD37@JcK-HP8200.jck.com>
In-Reply-To: <CA+9kkMDZW9yPtDxtLTfY1=VS6itvHtXHF1qdZKtXdwwORwqnew@mail.gmail.com>
References: <20150311013300.GC12479@dyn.com> <CA+9kkMDZW9yPtDxtLTfY1=VS6itvHtXHF1qdZKtXdwwORwqnew@mail.gmail.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.35
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <http://mailarchive.ietf.org/arch/msg/lucid/oqUzKDYMCTP-pSjEGFV4Iis7eYk>
Cc: lucid@ietf.org
Subject: Re: [Lucid] [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
X-BeenThere: lucid@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Locale-free UniCode Identifiers \(LUCID\)" <lucid.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lucid>, <mailto:lucid-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/lucid/>
List-Post: <mailto:lucid@ietf.org>
List-Help: <mailto:lucid-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lucid>, <mailto:lucid-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 11 Mar 2015 21:15:48 -0000

Ted,

I won't claim to more than about 10% more expertise, but maybe I
can clarify some of the issues you raise (and with which I
generally agree)...

--On Wednesday, March 11, 2015 10:58 -0700 Ted Hardie
<ted.ietf@gmail.com> wrote:

> Hi Andrew,
> 
> So, I'm far from an expert on this, but I have some concerns
> about the proposal to use "visually" here in the way Mark
> proposes.  If I understand things correctly, this problem will
> also occur in any system that reads a sequence of characters
> aloud.

Or not.  To give more familiar example than the one that started
this (I'm assuming there are no first-language speakers of Fula
on this list and, if there are, that they probably don't
primarily write it in Arabic characters), text to speech
programs generally need to know the language involved and often
have language-specific dictionaries.  Anyone who speaks
competent French knows that English vowel phonemes and French
ones are different.  When one set is used to render the other
language, what first-language speakers of the second language
hear is "weird accent" at best and speech that is
incomprehensible at worst.   English is particularly bad in this
regard because it has a lot fewer vowel (and consonant, but the
vowels are more obvious) symbols than it has phonemes, while
many languages that use Latin Script "decorate" their characters
in various ways that provide much more written distinction among
those phonemes.

In written form, type style and rendering distinctions, which
are very much tied up with "visual", involve many of the same
issues because accurate rendering is often tied up with
language, rather than being intrinsic properties of the script.

Interestingly, as soon as a suggestion is made that takes us to
the "language" distinctions that are often key to both "visual"
and "spoken", it also brings up back to where the original
discussion about U+08A1 --the discussion that led to our
understanding of the extent and complexity of these problems--
started: if one knows the language contexts, the distinctions
that Unicode makes by not having some characters decompose
become obvious.  Conversely, if one it talking about the DNS or
other identifiers in which language information is inherently
unavailable or non-applicable, then statements and ideas that
depend on language just don't help.

There are a second set of distinctions that get involved here
and that do not depend on language -- a collection of combining
characters that (using a summary with which Unicode experts
might not agree) are not defined precisely enough that one can
predict how they will overlay on or combine with the base
characters.  The resulting precomposed characters (almost)never
decomponse.  This issue is discussed briefly in the forth
paragraph of Section 3.3.1 of
draft-klensin-idna-5892upd-unicode70-04 and the piece of the
Unicode Standard referenced there as "Unicod70-Overlay" (sic - I
hate it when a typo is spotted immediately after a document is
posted) so this isn't _just_ a language-distinction (or
phoneme-distinction) problem, but that is part of it.

>  That is, there is no way to distinguish in an audio
> system between U+08A1 and the combination of U+0628 and
> U+0654.

Actually, if we are willing to drag the discussion back into
into Hamza issues (one of the reasons why
draft-klensin-idna-5892upd-unicode70-04 is twice the length of
draft-sullivan-lucid-prob-stmt), there is a significant phonetic
difference.  See section 2.2.2 of the latter and a different
(and more comprehensive) version in 3.2.3 of the former and, as
needed, the Unicode explanations it references.  I prefer to
think of these distinctions as language ones rather than
phonetic ones because if they were the latter, no normal human
being would be able to tell which code point to use even though
reading (i.e., text to speech) systems would be happier.  If
both the combining sequence and the precomposed character could
reasonably appear with the same language and be treated as
distinct, there would be a big problem but, if that problem
exists, I haven't seen an example yet.

The other problem with "visual" is that it pulls us back into
subjective confusability and that is both a nightmare we would
prefer to avoid if possible and something that has been the
source of earlier claims that there is no solvable problem in
this area.

>  The current draft talks about rendering the glyph,
> and I think that is closer to correct--it doesn't matter
> whether the rendering is audio or graphical for the problem to
> occur.  In the audio forms, the  quality of  running text
> providing context vs. independent identifiers without it also
> seems similar, at least to me.
>...

    john