Re: [apps-discuss] [xmpp] i18n intro, Sunday 14:00-16:00

Joe Hildebrand <> Fri, 22 July 2011 15:26 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 61AE321F8511; Fri, 22 Jul 2011 08:26:20 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -103.932
X-Spam-Status: No, score=-103.932 tagged_above=-999 required=5 tests=[AWL=-0.300, BAYES_00=-2.599, J_CHICKENPOX_31=0.6, MIME_8BIT_HEADER=0.3, RCVD_IN_DNSWL_MED=-4, RCVD_NUMERIC_HELO=2.067, USER_IN_WHITELIST=-100]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id XfwfaMKjsDGM; Fri, 22 Jul 2011 08:26:19 -0700 (PDT)
Received: from ( []) by (Postfix) with SMTP id 8C17521F8510; Fri, 22 Jul 2011 08:26:19 -0700 (PDT)
Received: from SRV-EXSC03.webex.local ([]) by with Microsoft SMTPSVC(6.0.3790.4675); Fri, 22 Jul 2011 08:26:18 -0700
Received: from ([]) by SRV-EXSC03.webex.local ([]) with Microsoft Exchange Server HTTP-DAV ; Fri, 22 Jul 2011 15:26:18 +0000
User-Agent: Microsoft-Entourage/
Date: Fri, 22 Jul 2011 09:26:17 -0600
From: Joe Hildebrand <>
To: "Martin J. =?ISO-8859-1?B?RPxyc3Q=?=" <>
Message-ID: <>
Thread-Topic: [xmpp] [apps-discuss] i18n intro, Sunday 14:00-16:00
Thread-Index: AcxIg7MfGf6kV+3Dnk+rFdfkWLCijw==
In-Reply-To: <>
Mime-version: 1.0
Content-type: text/plain; charset="ISO-8859-1"
Content-transfer-encoding: quoted-printable
X-OriginalArrivalTime: 22 Jul 2011 15:26:18.0175 (UTC) FILETIME=[B3D2BCF0:01CC4883]
Subject: Re: [apps-discuss] [xmpp] i18n intro, Sunday 14:00-16:00
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: General discussion of application-layer protocols <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Fri, 22 Jul 2011 15:26:20 -0000

On 7/22/11 1:50 AM, "Martin J. Dürst" <> wrote:

>> First some assumptions:
>> - Stringprep is currently one of the performance hotspots of some XMPP
>> servers.
> Is that an assumption backed by facts or a wild guess?

I can only talk definitively about one server, but yes, I've got data to
back up that assumption for that one server.

>> - If the spec is written that clients SHOULD perform canonicalization, many
>> in our community will, particularly if they know that they will get better
>> performance from the server.
> That's the same for NFC and NFD, isn't it? The advantage of NFC is that
> it's designed to more-or-less match what's out there, so you get the
> advantage that there is more stuff that's already canonicalized even if
> if a client doesn't do anything. That gets reinforced by both the IETF
> and the W3C telling everybody to use NFC.

Absolutely.  However, my point is that our client community, unlike the MUA
community (for example) is more likely to adopt change.

>> The property of NFK?D that we like is that if you have a string of
>> codepoints that is already in NFK?D, you can check that the string is in the
>> correct normalization form without having to allocate memory.  With NFK?C,
>> you'll have to decompose (allocating memory), recompose (at some finite CPU
>> cost), then recompose (possibly allocating *again*) just to check if you
>> have already done the normalization.
> Nope. That's just how the algorithm was defined, because the
> decomposition component was already around, and NFD was what defined
> canonical equivalence.
> As an example, has
> some code (both UTF-8 and UTF-16) for compact (memory footprint) and
> reasonably fast NFC check. Please ignore the "XML 1.1" in the title.
> Also please note that this is proof of concept code, and may need
> additional testing and of course upgrade to the newest version of
> Unicode. I don't have any open bug reports, but that may be for other
> reasons than that there are no bugs.

Am I correct that the key bit of your algorithm is:

"""Check for potential combination with starter. The list of recombinations
is calculated so that any combining character that would lead to a change
(full combination, recombination so that the combining character gets
combined in but another combining character is separated, complete
decomposition,...) is listed. 2176 such pairs have been found."""

Can you talk more about that?  Do we expect that future versions of Unicode
will change that set of pairs?

> Probably just fine for most cases. Potentially a problem for wide/narrow
> in places like Japan.

Yup.  That's why the wide/narrow stuff is still on the list of questions to
be addressed with this approach.  I also fully expect that there are N other
codepoints that are marked as having compatibility decomposition but are
actually in widespread use, and someone's going to complain about them one

>> The idea is that clients SHOULD normalize, servers double-check inputs from
>> non-trusted sources (like clients and other servers), then always store and
>> forward the normalized version.
> That's fine. But it doesn't explain the choice of NFD (vs. NFC).

One other point, the re-composition stuff in the C forms *is* objectively
more difficult to implement, and will *always* take more CPU.  If it's not
required, it should be avoided.  And I haven't seen any compelling arguments
as to why C should be preferred.  What I've seen so far:

- That's the way we've always done it.  Not compelling to me if we're going
to switch away from NFKC in any way.
- Saves a few bytes on the wire.  Not compelling in a world with
compression.  (see:
- More likely to render correctly.  Not with today's font renderers.

Are there any others I'm missing?

Joe Hildebrand