Re: [Lucid] [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
John C Klensin <john-ietf@jck.com> Wed, 11 March 2015 21:15 UTC
Return-Path: <john-ietf@jck.com>
X-Original-To: lucid@ietfa.amsl.com
Delivered-To: lucid@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A2F3C1A8786 for <lucid@ietfa.amsl.com>; Wed, 11 Mar 2015 14:15:48 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.61
X-Spam-Level:
X-Spam-Status: No, score=-2.61 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, T_RP_MATCHES_RCVD=-0.01] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id AZtghPQaDR1a for <lucid@ietfa.amsl.com>; Wed, 11 Mar 2015 14:15:46 -0700 (PDT)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 90CA61A878A for <lucid@ietf.org>; Wed, 11 Mar 2015 14:15:39 -0700 (PDT)
Received: from [198.252.137.35] (helo=JcK-HP8200.jck.com) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1YVnyW-0001O2-Ce; Wed, 11 Mar 2015 17:15:32 -0400
Date: Wed, 11 Mar 2015 17:15:27 -0400
From: John C Klensin <john-ietf@jck.com>
To: Ted Hardie <ted.ietf@gmail.com>, Andrew Sullivan <ajs@anvilwalrusden.com>
Message-ID: <4A00B59133258EB59C38CD37@JcK-HP8200.jck.com>
In-Reply-To: <CA+9kkMDZW9yPtDxtLTfY1=VS6itvHtXHF1qdZKtXdwwORwqnew@mail.gmail.com>
References: <20150311013300.GC12479@dyn.com> <CA+9kkMDZW9yPtDxtLTfY1=VS6itvHtXHF1qdZKtXdwwORwqnew@mail.gmail.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.35
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <http://mailarchive.ietf.org/arch/msg/lucid/oqUzKDYMCTP-pSjEGFV4Iis7eYk>
Cc: lucid@ietf.org
Subject: Re: [Lucid] [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
X-BeenThere: lucid@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Locale-free UniCode Identifiers \(LUCID\)" <lucid.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lucid>, <mailto:lucid-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/lucid/>
List-Post: <mailto:lucid@ietf.org>
List-Help: <mailto:lucid-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lucid>, <mailto:lucid-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 11 Mar 2015 21:15:48 -0000
Ted, I won't claim to more than about 10% more expertise, but maybe I can clarify some of the issues you raise (and with which I generally agree)... --On Wednesday, March 11, 2015 10:58 -0700 Ted Hardie <ted.ietf@gmail.com> wrote: > Hi Andrew, > > So, I'm far from an expert on this, but I have some concerns > about the proposal to use "visually" here in the way Mark > proposes. If I understand things correctly, this problem will > also occur in any system that reads a sequence of characters > aloud. Or not. To give more familiar example than the one that started this (I'm assuming there are no first-language speakers of Fula on this list and, if there are, that they probably don't primarily write it in Arabic characters), text to speech programs generally need to know the language involved and often have language-specific dictionaries. Anyone who speaks competent French knows that English vowel phonemes and French ones are different. When one set is used to render the other language, what first-language speakers of the second language hear is "weird accent" at best and speech that is incomprehensible at worst. English is particularly bad in this regard because it has a lot fewer vowel (and consonant, but the vowels are more obvious) symbols than it has phonemes, while many languages that use Latin Script "decorate" their characters in various ways that provide much more written distinction among those phonemes. In written form, type style and rendering distinctions, which are very much tied up with "visual", involve many of the same issues because accurate rendering is often tied up with language, rather than being intrinsic properties of the script. Interestingly, as soon as a suggestion is made that takes us to the "language" distinctions that are often key to both "visual" and "spoken", it also brings up back to where the original discussion about U+08A1 --the discussion that led to our understanding of the extent and complexity of these problems-- started: if one knows the language contexts, the distinctions that Unicode makes by not having some characters decompose become obvious. Conversely, if one it talking about the DNS or other identifiers in which language information is inherently unavailable or non-applicable, then statements and ideas that depend on language just don't help. There are a second set of distinctions that get involved here and that do not depend on language -- a collection of combining characters that (using a summary with which Unicode experts might not agree) are not defined precisely enough that one can predict how they will overlay on or combine with the base characters. The resulting precomposed characters (almost)never decomponse. This issue is discussed briefly in the forth paragraph of Section 3.3.1 of draft-klensin-idna-5892upd-unicode70-04 and the piece of the Unicode Standard referenced there as "Unicod70-Overlay" (sic - I hate it when a typo is spotted immediately after a document is posted) so this isn't _just_ a language-distinction (or phoneme-distinction) problem, but that is part of it. > That is, there is no way to distinguish in an audio > system between U+08A1 and the combination of U+0628 and > U+0654. Actually, if we are willing to drag the discussion back into into Hamza issues (one of the reasons why draft-klensin-idna-5892upd-unicode70-04 is twice the length of draft-sullivan-lucid-prob-stmt), there is a significant phonetic difference. See section 2.2.2 of the latter and a different (and more comprehensive) version in 3.2.3 of the former and, as needed, the Unicode explanations it references. I prefer to think of these distinctions as language ones rather than phonetic ones because if they were the latter, no normal human being would be able to tell which code point to use even though reading (i.e., text to speech) systems would be happier. If both the combining sequence and the precomposed character could reasonably appear with the same language and be treated as distinct, there would be a big problem but, if that problem exists, I haven't seen an example yet. The other problem with "visual" is that it pulls us back into subjective confusability and that is both a nightmare we would prefer to avoid if possible and something that has been the source of earlier claims that there is no solvable problem in this area. > The current draft talks about rendering the glyph, > and I think that is closer to correct--it doesn't matter > whether the rendering is audio or graphical for the problem to > occur. In the audio forms, the quality of running text > providing context vs. independent identifiers without it also > seems similar, at least to me. >... john
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Asmus Freytag
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Andrew Sullivan
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Andrew Sullivan
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… John C Klensin
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… John C Klensin
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Asmus Freytag
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… John C Klensin
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Andrew Sullivan
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Asmus Freytag
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… John C Klensin
- [Lucid] [mark@macchiato.com: Re: Non-normalizable… Andrew Sullivan
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… Ted Hardie
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… Ted Hardie
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… Shawn Steele
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… Andrew Sullivan
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… John C Klensin
- [Lucid] FW: [mark@macchiato.com: Re: Non-normaliz… Shawn Steele