Re: [precis] One change before Last Call

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Sat, 07 February 2015 12:11 UTC

Message-ID: <54D600A3.1070900@it.aoyama.ac.jp>
Date: Sat, 07 Feb 2015 21:10:11 +0900
From: "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: John C Klensin <john-ietf@jck.com>, Pete Resnick <presnick@qti.qualcomm.com>
References: <54D18760.8090100@qti.qualcomm.com> <2423dah4ipo7jhvjr3vq4alidnsk7qdt5v@hive.bjoern.hoehrmann.de> <54D190FB.9060609@qti.qualcomm.com> <f253dat6a3a9t52eh4rlaeeoeuhq02jseo@hive.bjoern.hoehrmann.de> <54D1A2C7.8020301@qti.qualcomm.com> <1CC9F36F2D7AF58BD7353D80@JcK-HP8200.jck.com> <54D27694.7080300@qti.qualcomm.com> <eu16dal8d14kjtrqhn98hrd6c63mbsvvp2@hive.bjoern.hoehrmann.de> <54D3449B.10306@it.aoyama.ac.jp> <916E45ADE98041A7C3A959D7@JcK-HP8200.jck.com>
In-Reply-To: <916E45ADE98041A7C3A959D7@JcK-HP8200.jck.com>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: quoted-printable
Archived-At: <http://mailarchive.ietf.org/arch/msg/precis/FZQV4wDGxvve5SXfUfUpYZE82Jk>
Cc: precis@ietf.org
Subject: Re: [precis] One change before Last Call
Precedence: list

On 2015/02/06 00:58, John C Klensin wrote:
> --On Thursday, February 05, 2015 19:23 +0900 "\"Martin J.
> Dürst\"" <duerst@it.aoyama.ac.jp> wrote:

>> Except maybe for two or three people on the Unicode Technical
>> Committee I know, I wouldn't want to claim that anybody knows
>> the implications of even a significant (in terms of size and
>> use) part of the Unicode repertoire. And for the average
>> implementer or system administrator, it's of course much less.
>> But we definitely don't want that to lead to a situation where
>> we go back to (some time) last century and ASCII only.
>
> Martin,
>
> I don't see how you get from there to "ASCII only".  First,
> there are a lot of people in the world who don't "understand the
> implications of" Latin Script, even the basic undecorated Latin
> characters and even though they might use them.  I think that,
> while it may require some effort on their part, it is reasonable
> to expect implementers and system administrators who establish
> rules for identifiers to take responsibility for understanding
> the use and possible risks associated with the characters of
> their own scripts, especially the subset of those characters
> that are relevant to their own languages.

Hello John - You are right that not all people who use a script 
understand it. I could give some specific examples. But they usually 
think they understand it, so that will lead to the same outcome 
regarding what we write. Also, you are correct that not all implementors 
and system administrators are from the Latin writing parts of the world. 
But a good majority is, and has a huge influence on the rest of the 
world, and once one is in a defensive mindset, it's easy to come to the 
conclusion "let's use ASCII, that has worked for decades, that can't be 
wrong (and even if it is, nobody can be blamed/fired for choosing it)." 
So "ASCII only" was somewhat of a simplification, but not a big one.

> I recognize that makes it hard to design software systems that
> are somehow internationally script-insensitive where identifiers
> are concerned, but I think we have to live with that as the
> price of the diversity of human languages and writing systems.
> It may also imply a need for software implementations that are
> far more rule-driven, possibly with locally-tailorable rules for
> individual scripts, languages, and context, rather than an
> approach that is construed as "this magic table of characters is
> ok".   Again, that may be the price of the diversity of human
> writing system and, by looking at tables and global profiles, we
> may just be in denial about that diversity and its implications.

I agree, at least in theory.

> None of the above is made any easier by Unicode decisions,
> however justified by the same diversity issues, pushing us from
> design principles that apply to all of the coding system, to
> design principles that are different on a per-script basis, to
> specific and exception-driven rules such as "normalization does
> all of the right comparison things within the Latin script
> except for the following code points for which decomposition is
> appropriate under some circumstances but not others" or "there
> are case matching rules that are universally applicable except
> for certain specific locales and code points, where special
> treatment is needed".

I spent quite a bit of my time in colleague to learn Kanji. Quite a bit 
later, I spent some time to analyse Kanji shapes in an attempt to create 
some software for font design. I published a few papers, but didn't get 
much farther. But one thing I learned was that although Kanji were built 
up highly regularly, once you had to account for them in their full 
numbers, there was always some kind of edge case or exception that broke 
(or confirmed, as the saying goes) the rules.

What you talk about above shows that the same applies to Unicode 
overall: Even with a very strong attempt of keeping everything in line 
with simple rules, there will always be some corner case or exception.

> It may be that we have been in denial, that the whole concept of
> identifiers without language context is unworkable for at least
> some protocols, and that we should be thinking of an
> "internationalized identifier" as a tuple with a string and
> language identifier.  Comparisons would then depend, not on
> catenation and bit-by-bit comparison but on
> consideration of the language identifier based on RFC 4647 and
> then interpretation and comparison of the string based on that
> information.

I think we have very good reasons to have rejected this approach 
virtually since the first moment we thought about internationalized 
identifiers. Human eyesight doesn't see invisible BCP 47 'color' painted 
over text, and people don't think that way.

> That suggests that we should finish the PRECIS work based on
> current documents rather than looking for a more prefect
> solution (or textual phrasing) now.  However, it does also
> suggest that, for at least some purposes, the PRECIS work may be
> a waypoint rather than a final answer.

As I tried to explain, I just wanted to make sure we don't throw out the 
baby with the bathwater. The new text is fine with me.

Regards,   Martin.

Re: [precis] One change before Last Call Martin J. Dürst
[precis] One change before Last Call Pete Resnick
Re: [precis] One change before Last Call Bjoern Hoehrmann
Re: [precis] One change before Last Call Pete Resnick
Re: [precis] One change before Last Call Bjoern Hoehrmann
Re: [precis] One change before Last Call Pete Resnick
Re: [precis] One change before Last Call Bjoern Hoehrmann
Re: [precis] One change before Last Call Peter Saint-Andre - &yet
Re: [precis] One change before Last Call John C Klensin
Re: [precis] One change before Last Call Pete Resnick
Re: [precis] One change before Last Call John C Klensin
Re: [precis] One change before Last Call Andrew Sullivan
Re: [precis] One change before Last Call Peter Saint-Andre - &yet
Re: [precis] One change before Last Call Pete Resnick
Re: [precis] One change before Last Call Peter Saint-Andre - &yet
Re: [precis] One change before Last Call Bjoern Hoehrmann
Re: [precis] One change before Last Call Pete Resnick
Re: [precis] One change before Last Call Martin J. Dürst
Re: [precis] One change before Last Call Andrew Sullivan
Re: [precis] One change before Last Call John C Klensin
Re: [precis] One change before Last Call Peter Saint-Andre - &yet
Re: [precis] One change before Last Call Pete Resnick
Re: [precis] One change before Last Call Peter Saint-Andre - &yet