Re: [apps-discuss] i18n intro, Sunday 14:00-16:00

"Martin J. Dürst" <> Fri, 22 July 2011 07:51 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id B119D21F865E for <>; Fri, 22 Jul 2011 00:51:40 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -99.555
X-Spam-Status: No, score=-99.555 tagged_above=-999 required=5 tests=[AWL=-0.365, BAYES_00=-2.599, HELO_EQ_JP=1.244, HOST_EQ_JP=1.265, J_CHICKENPOX_31=0.6, MIME_8BIT_HEADER=0.3, USER_IN_WHITELIST=-100]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id bHdIgs+oQvtZ for <>; Fri, 22 Jul 2011 00:51:40 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id E67E321F865D for <>; Fri, 22 Jul 2011 00:51:39 -0700 (PDT)
Received: from ([]) by (secret/secret) with SMTP id p6M7pXjF021980 for <>; Fri, 22 Jul 2011 16:51:33 +0900
Received: from (unknown []) by with smtp id 713b_3376_6b316d78_b437_11e0_a359_001d0969ab06; Fri, 22 Jul 2011 16:51:33 +0900
Received: from [IPv6:::1] ([]:55854) by with [XMail 1.22 ESMTP Server] id <S1532123> for <> from <>; Fri, 22 Jul 2011 16:51:36 +0900
Message-ID: <>
Date: Fri, 22 Jul 2011 16:50:27 +0900
From: "\"Martin J. Dürst\"" <>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv: Gecko/20100722 Eudora/3.0.4
MIME-Version: 1.0
To: Joe Hildebrand <>
References: <>
In-Reply-To: <>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 8bit
Subject: Re: [apps-discuss] i18n intro, Sunday 14:00-16:00
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: General discussion of application-layer protocols <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Fri, 22 Jul 2011 07:51:40 -0000

Hello Joe,

On 2011/07/22 1:28, Joe Hildebrand wrote:
> On 7/21/11 1:03 AM, "Martin J. Dürst"<>  wrote:
>> Slide 123: Good to see that. By the way, I seem to remember both John
> and me
>> begging you for an explanation of why Jabber wants to use NFD a
> few months
>> ago, and I'm not sure I have seen an answer. Now might be a
> good time (if you
>> already sent one, a pointer would be appreciated).
> Let me try.

Glad to start talking.

> First some assumptions:
> - Stringprep is currently one of the performance hotspots of some XMPP
> servers.

Is that an assumption backed by facts or a wild guess?

> - XMPP does not guarantee that the original form of the address that is
> entered by the user or sent on the first hop is transmitted without
> modification to other hops in the system.
> - As such, many XMPP servers optimize by performing canonicalization at the
> edges of their system and even store the canonical version for future
> comparison.

Makes sense.

> - If the spec is written that clients SHOULD perform canonicalization, many
> in our community will, particularly if they know that they will get better
> performance from the server.

That's the same for NFC and NFD, isn't it? The advantage of NFC is that 
it's designed to more-or-less match what's out there, so you get the 
advantage that there is more stuff that's already canonicalized even if 
if a client doesn't do anything. That gets reinforced by both the IETF 
and the W3C telling everybody to use NFC.

> The property of NFK?D that we like is that if you have a string of
> codepoints that is already in NFK?D, you can check that the string is in the
> correct normalization form without having to allocate memory.  With NFK?C,
> you'll have to decompose (allocating memory), recompose (at some finite CPU
> cost), then recompose (possibly allocating *again*) just to check if you
> have already done the normalization.

Nope. That's just how the algorithm was defined, because the 
decomposition component was already around, and NFD was what defined 
canonical equivalence.

As an example, has 
some code (both UTF-8 and UTF-16) for compact (memory footprint) and 
reasonably fast NFC check. Please ignore the "XML 1.1" in the title. 
Also please note that this is proof of concept code, and may need 
additional testing and of course upgrade to the newest version of 
Unicode. I don't have any open bug reports, but that may be for other 
reasons than that there are no bugs.

Also, there are various ways to trade off speed against memory (see also 
Björn's mail), but the memory here is just the footprint of the sharable 
data, there's no need for lots of memory per conversion.

> For the K portion, I have found John's argument compelling that codepoints
> with compatibility decompositions should just be prohibited in our
> localparts.

Probably just fine for most cases. Potentially a problem for wide/narrow 
in places like Japan.

> In our resourceparts, I'm of the opinion that we don't need to
> compatibility map -- it's fine for all of those codepoints to stay distinct.

Yes indeed.

> The idea is that clients SHOULD normalize, servers double-check inputs from
> non-trusted sources (like clients and other servers), then always store and
> forward the normalized version.

That's fine. But it doesn't explain the choice of NFD (vs. NFC).

Regards,   Martin.