Re: [apps-discuss] [xmpp] i18n intro, Sunday 14:00-16:00

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Mon, 25 July 2011 05:38 UTC

Return-Path: <duerst@it.aoyama.ac.jp>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B9B4021F85C0 for <apps-discuss@ietfa.amsl.com>; Sun, 24 Jul 2011 22:38:38 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -100.549
X-Spam-Level:
X-Spam-Status: No, score=-100.549 tagged_above=-999 required=5 tests=[AWL=0.641, BAYES_00=-2.599, GB_I_LETTER=-2, HELO_EQ_JP=1.244, HOST_EQ_JP=1.265, J_CHICKENPOX_31=0.6, MIME_8BIT_HEADER=0.3, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id pQOWN2AgpQdd for <apps-discuss@ietfa.amsl.com>; Sun, 24 Jul 2011 22:38:38 -0700 (PDT)
Received: from scintmta02.scbb.aoyama.ac.jp (scintmta02.scbb.aoyama.ac.jp [133.2.253.34]) by ietfa.amsl.com (Postfix) with ESMTP id 4BD6521F8891 for <apps-discuss@ietf.org>; Sun, 24 Jul 2011 22:38:37 -0700 (PDT)
Received: from scmse02.scbb.aoyama.ac.jp ([133.2.253.231]) by scintmta02.scbb.aoyama.ac.jp (secret/secret) with SMTP id p6P5cXVg028103 for <apps-discuss@ietf.org>; Mon, 25 Jul 2011 14:38:33 +0900
Received: from (unknown [133.2.206.133]) by scmse02.scbb.aoyama.ac.jp with smtp id 0835_6ae1_561a51ca_b680_11e0_b526_001d096c5782; Mon, 25 Jul 2011 14:38:33 +0900
Received: from [IPv6:::1] ([133.2.210.5]:59236) by itmail.it.aoyama.ac.jp with [XMail 1.22 ESMTP Server] id <S1533759> for <apps-discuss@ietf.org> from <duerst@it.aoyama.ac.jp>; Mon, 25 Jul 2011 14:38:32 +0900
Message-ID: <4E2D0124.9040602@it.aoyama.ac.jp>
Date: Mon, 25 Jul 2011 14:37:40 +0900
From: "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.9) Gecko/20100722 Eudora/3.0.4
MIME-Version: 1.0
To: Joe Hildebrand <joe.hildebrand@webex.com>
References: <CA4EF2B9.C0B4%joe.hildebrand@webex.com>
In-Reply-To: <CA4EF2B9.C0B4%joe.hildebrand@webex.com>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 8bit
Cc: apps-discuss@ietf.org, xmpp@ietf.org
Subject: Re: [apps-discuss] [xmpp] i18n intro, Sunday 14:00-16:00
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 25 Jul 2011 05:38:38 -0000

Hello Joe,


On 2011/07/23 0:26, Joe Hildebrand wrote:
> On 7/22/11 1:50 AM, "Martin J. Dürst"<duerst@it.aoyama.ac.jp>  wrote:
>
>>> First some assumptions:
>>> - Stringprep is currently one of the performance hotspots of some XMPP
>>> servers.
>>
>> Is that an assumption backed by facts or a wild guess?
>
> I can only talk definitively about one server, but yes, I've got data to
> back up that assumption for that one server.

Okay, thanks for the confirmation.

>>> - If the spec is written that clients SHOULD perform canonicalization, many
>>> in our community will, particularly if they know that they will get better
>>> performance from the server.
>>
>> That's the same for NFC and NFD, isn't it? The advantage of NFC is that
>> it's designed to more-or-less match what's out there, so you get the
>> advantage that there is more stuff that's already canonicalized even if
>> if a client doesn't do anything. That gets reinforced by both the IETF
>> and the W3C telling everybody to use NFC.
>
> Absolutely.  However, my point is that our client community, unlike the MUA
> community (for example) is more likely to adopt change.

Yes, but there's no need to adopt change without a really good reason. 
And there should be no need for XMPP to differ from the rest of the IETF.

BTW, can you tell us what programming language the server is written in? 
Are you using any kinds of libraries for character-related business?


>>> The property of NFK?D that we like is that if you have a string of
>>> codepoints that is already in NFK?D, you can check that the string is in the
>>> correct normalization form without having to allocate memory.  With NFK?C,
>>> you'll have to decompose (allocating memory), recompose (at some finite CPU
>>> cost), then recompose (possibly allocating *again*) just to check if you
>>> have already done the normalization.
>>
>> Nope. That's just how the algorithm was defined, because the
>> decomposition component was already around, and NFD was what defined
>> canonical equivalence.
>>
>> As an example, http://www.w3.org/2003/06/xml1.1test/Overview.html has
>> some code (both UTF-8 and UTF-16) for compact (memory footprint) and
>> reasonably fast NFC check. Please ignore the "XML 1.1" in the title.
>> Also please note that this is proof of concept code, and may need
>> additional testing and of course upgrade to the newest version of
>> Unicode. I don't have any open bug reports, but that may be for other
>> reasons than that there are no bugs.
>
> Am I correct that the key bit of your algorithm is:

In some way, yes.

> """Check for potential combination with starter. The list of recombinations
> is calculated so that any combining character that would lead to a change
> (full combination, recombination so that the combining character gets
> combined in but another combining character is separated, complete
> decomposition,...) is listed. 2176 such pairs have been found."""
>
> Can you talk more about that?

The actual list is in
http://dev.w3.org/cvsweb/~checkout~/charlint/xml1.1test/nf16data.c?content-type=text/plain, 
in the array 'recombiners'.

The first entry is {0x003C, 0x0338}, which is a '<' and a COMBINING LONG 
SOLIDUS OVERLAY. It's here because there is U+226E, NOT LESS-THAN, which 
is canonically equivalent, and therefore the sequence U+003C, U+0338 (or 
any sequence  U+003C, [other combining characters], U+0338 ) isn't in NFC.

Let's look at another entry: {0x00FC, 0x0300}. This is LATIN SMALL 
LETTER U WITH DIAERESIS (ü) followed by COMBINING GRAVE ACCENT. This is 
not in NFC because there is U+01DC, LATIN SMALL LETTER U WITH DIAERESIS 
AND GRAVE.

Another, somewhat more involved example: {0x1EC5, 0x0323}. This is LATIN 
SMALL LETTER E WITH CIRCUMFLEX AND TILDE and COMBINING DOT BELOW. This 
is not in NFC because there is U+1EC7, LATIN SMALL LETTER E WITH 
CIRCUMFLEX AND DOT BELOW, and NFC is U+1EC7 + U+0303 (COMBINING TILDE) 
because combining characters below have a lower combining class and 
therefore are preferred when recombining.

> Do we expect that future versions of Unicode
> will change that set of pairs?

This is at least in theory possible. The data is from 2003. With input 
from the W3C, the Unicode Consortium created a fairly strong stability 
policies regarding normalization (please see 
http://www.unicode.org/policies/stability_policy.html, under 
"Normalization Stability"). As a result, for new precomposed characters, 
they had to be included in the combining exclusions. That meant that the 
benefit of proposing them as precomposed characters was lost, and there 
were virtually no such characters that made it all the way. It's still 
possible to propose a new script with some precomposed/decomposed 
alternatives, but I wouldn't know about a case where this has been done.

Please note that such additions would also affect decomposition, so a 
software (table) update would be needed independent of whether you go 
for NFC or NFD.

>> Probably just fine for most cases. Potentially a problem for wide/narrow
>> in places like Japan.
>
> Yup.  That's why the wide/narrow stuff is still on the list of questions to
> be addressed with this approach.  I also fully expect that there are N other
> codepoints that are marked as having compatibility decomposition but are
> actually in widespread use, and someone's going to complain about them one
> day.
>
>>> The idea is that clients SHOULD normalize, servers double-check inputs from
>>> non-trusted sources (like clients and other servers), then always store and
>>> forward the normalized version.
>>
>> That's fine. But it doesn't explain the choice of NFD (vs. NFC).
>
> One other point, the re-composition stuff in the C forms *is* objectively
> more difficult to implement,

Yes, but usually, you'd just use a library, and be done with it.

> and will *always* take more CPU.

Not if most of your data is already in NFC, and you can check that quickly.

I do not have actual statistics, but I don't know any major language or 
script where data would from the start be in NFD. On the other hand, 
there are examples such as Korean where essentially everything is in 
NFC, and NFD expands the number of characters by a factor of between 2 
and 3, affecting every single character. For other big languages such as 
Spanish, Portuguese, French, German, Italian, Polish, Japanese,... data 
is also essentially in NFC, although the characters that would be 
decomposed are less frequent than the 100% for Korean.

> If it's not required, it should be avoided.

That's similar to the argument about saving bytes on the wire. It's not 
compelling in a world with efficient libraries and faster and faster 
processors.

Also, please note that the code at 
http://www.w3.org/2003/06/xml1.1test/Overview.html isn't really the only 
way to optimize, and it's just a proof of concept, not really optimized 
based on actual benchmarks. My guess is that there are things that are 
"overoptimized" (i.e. too complicated without a corresponding speed 
gain) and some spots that could still be further optimized. In many 
cases, checking for all-ASCII first can shave off quite a bit of time. 
Also, doing a lookup of the string (which I guess has to happen sooner 
or later anyway) before checking for normalization may speed up things.

> And I haven't seen any compelling arguments
> as to why C should be preferred.  What I've seen so far:
>
> - That's the way we've always done it.  Not compelling to me if we're going
> to switch away from NFKC in any way.

That may not be compelling on the level of XMPP, but on the level of the 
IETF or the Internet, it's a different issue.

Also, under the assumption that actual characters where NFKC makes a 
difference are few and far between, NFKC and NFC are quite close. A 
change from NFKC to NFC may be much smoother than a change from NFKC to NFD.

> - Saves a few bytes on the wire.  Not compelling in a world with
> compression.  (see: http://xmpp.org/extensions/xep-0138.html)

Agreed.

> - More likely to render correctly.  Not with today's font renderers.

Depends on the exact characters in question. Can go either way.

> Are there any others I'm missing?

See above (libraries, frequency of data in NFC vs. NFD,...).


Regards,   Martin.