Re: [xmpp] 6122bis: Unicode versions

Peter Saint-Andre <stpeter@stpeter.im> Tue, 19 July 2011 16:22 UTC

Return-Path: <stpeter@stpeter.im>
X-Original-To: xmpp@ietfa.amsl.com
Delivered-To: xmpp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 550611F0C39 for <xmpp@ietfa.amsl.com>; Tue, 19 Jul 2011 09:22:30 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -103.723
X-Spam-Level:
X-Spam-Status: No, score=-103.723 tagged_above=-999 required=5 tests=[AWL=0.876, BAYES_00=-2.599, GB_I_LETTER=-2, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id QYlrWQ29ULQ2 for <xmpp@ietfa.amsl.com>; Tue, 19 Jul 2011 09:22:25 -0700 (PDT)
Received: from stpeter.im (mailhost.stpeter.im [207.210.219.225]) by ietfa.amsl.com (Postfix) with ESMTP id C71A81F0C37 for <xmpp@ietf.org>; Tue, 19 Jul 2011 09:22:25 -0700 (PDT)
Received: from dhcp-64-101-72-201.cisco.com (unknown [64.101.72.201]) (Authenticated sender: stpeter) by stpeter.im (Postfix) with ESMTPSA id EB4564005A; Tue, 19 Jul 2011 10:23:04 -0600 (MDT)
Message-ID: <4E25AF3F.5020400@stpeter.im>
Date: Tue, 19 Jul 2011 10:22:23 -0600
From: Peter Saint-Andre <stpeter@stpeter.im>
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Thunderbird/3.1.9
MIME-Version: 1.0
To: Waqas Hussain <waqas20@gmail.com>
References: <4E20989B.1030709@stpeter.im> <CALm9TZ_OE-CQ-=354bGBi_cDmtJKeoG7gwkTBnzhQXEkq2uuxw@mail.gmail.com> <4E24B0D8.90808@stpeter.im> <CALm9TZ_bno7ZVeoAHpvgPATBR7f7M1e0H3iGT=OzeZpQ6h1U-A@mail.gmail.com> <4E24F184.3040505@stpeter.im> <CALm9TZ_ZTqX3PzbTvprYmx6zF2KS6mO2H14SvgwghkQ9SAxFAw@mail.gmail.com>
In-Reply-To: <CALm9TZ_ZTqX3PzbTvprYmx6zF2KS6mO2H14SvgwghkQ9SAxFAw@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Cc: XMPP <xmpp@ietf.org>
Subject: Re: [xmpp] 6122bis: Unicode versions
X-BeenThere: xmpp@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: XMPP Working Group <xmpp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/xmpp>, <mailto:xmpp-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/xmpp>
List-Post: <mailto:xmpp@ietf.org>
List-Help: <mailto:xmpp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/xmpp>, <mailto:xmpp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 19 Jul 2011 16:22:30 -0000

On 7/18/11 11:49 PM, Waqas Hussain wrote:
> On Tue, Jul 19, 2011 at 7:52 AM, Peter Saint-Andre<stpeter@stpeter.im>  wrote:
>> On 7/18/11 5:12 PM, Waqas Hussain wrote:
>>>
>>> On Tue, Jul 19, 2011 at 3:16 AM, Peter Saint-Andre<stpeter@stpeter.im>
>>>   wrote:
>>>>
>>>> On 7/16/11 1:54 PM, Waqas Hussain wrote:
>>>>>
>>>>> On Sat, Jul 16, 2011 at 12:44 AM, Peter Saint-Andre<stpeter@stpeter.im>
>>>>>   wrote:
>>>>>>
>>>>>> The good thing about the post-stringprep world is that we have agility
>>>>>> with regard to Unicode versions (a.k.a. "Unicode agility"). No more
>>>>>> being stuck at Unicode 3.2!
>>>>>>
>>>>>> The bad thing is that we have Unicode agility. What if my client (or
>>>>>> your server) has Unicode 5.0 but my server has Unicode 6.0? The parties
>>>>>> might differ in their interpretation of certain code points, causing
>>>>>> problems with authentication, stanza routing, etc.
>>>>>>
>>>>>> We might be able to mitigate these problems if we had a way to discover
>>>>>> which version of Unicode the other side supports.
>>>>>
>>>>> That's the discussion I'm interested in. How can we mitigate Unicode
>>>>> version incompatibility? And also, stringprep and post-stringprep
>>>>> incompatability? Has there been anything written on this that I can
>>>>> read?
>>>>
>>>> So far, I am not too concerned about incompatibilities between Unicode
>>>> versions. See for example:
>>>>
>>>> https://datatracker.ietf.org/doc/draft-faltstrom-5892bis/
>>>>
>>>> As you can see from that document, only three rather obscure code points
>>>> changed in backward-incompatible ways between Unicode 5.0 and Unicode
>>>> 6.0.
>>>
>>> That's reassuring.
>>>
>>>> Naturally, the changes between Unicode 3.2 (hardcoded into stringprep)
>>>> and
>>>> Unicode 6.0 were more substantial. Most of those changes were new code
>>>> points, not code points that changed in backward-incompatible ways.
>>>> During
>>>> the transition from IDNA2003 to IDNA2008, in practice the most
>>>> troublesome
>>>> code points were:
>>>>
>>>>    00DF (LATIN SMALL LETTER SHARP S)
>>>>    03C2 (GREEK SMALL LETTER FINAL SIGMA)
>>>>
>>>> See http://tools.ietf.org/html/rfc5894#section-7.2 for details (there are
>>>> other troublesome characters, but those were the worst because they were
>>>> more widely deployed). Domain registrars know about those code points and
>>>> probably have special processes for dealing with them.
>>>
>>> I was mainly concerned about things like Jehan's suggestion of
>>> 're-encoding'. I don't think that's somewhere we want to go.
>>
>> Agreed.
>>
>>>>> I don't really see there being a good solution. The best we might
>>>>> reasonably be able to do is handle<jid-malformed/>      errors and just
>>>>> accept that either two incompatible entities wont be able to
>>>>> communicate at all,
>>>>
>>>> Unfortunate, but possible in a small number of cases.
>>>>
>>>>> or one entity might see the other entity as
>>>>> multiple JIDs.
>>>>
>>>> How so?
>>>
>>> See below.
>>>
>>>>> A recommendation that servers prep JIDs on all outgoing
>>>>> stanzas might fix the latter.
>>>>
>>>> s/recommendation/requirement/ :)
>>>>
>>>> But yes.
>>>
>>> Note what I mean here is that some servers while verifying JIDs on
>>> outgoing stanzas don't actually replace the to/from unprepped values
>>> with the prepped ones in what gets sent over the wire. So the JID
>>> ABC@example.com is sent as ABC@example.com over the wire, not as
>>> abc@example.com. This will interact badly with IDNA2003 and IDNA2008
>>> having different outputs for a given input.
>>
>> I see your point, and I agree that prepping all outbound JIDs would help to
>> avoid this problem. Paradoxically, prepping inbound JIDs would hurt, not
>> help (see below).
>
> It helps too: I could create the JID fussball@example.com on a
> IDNA2003 server, and send stanzas as both fußball@example.com and
> fussball@example.com. If my server passes the 'from' attribute as-is,
> I can make myself seem like two entities to an IDNA2008 server/client.
> Not too worrying a problem I suppose.

Sorry, I was not clear: I meant prepping on inbound s2s stanzas. The 
principle is that the first server processing a stanza must enforce the 
rules. I think your example of the example.com server allowing a client 
to send stanzas from either fußball@example.com or fussball@example.com 
is wrong, because an XMPP server that implements Nodeprep would have to 
prep the fußball username to fussball when the client sends the stanza.

>>> Effectively, if given unprepped JID string X, which IDNA2008 preps to
>>> string Y, but IDNA2003 accepts without prepping, and given the above
>>> server behavior, a client could receive stanzas from both X and Y, and
>>> treat them as the different JIDs when they are in fact the same
>>> (that's just one example, others, e.g., the reverse could also be
>>> possible). I haven't verified that this is actually possible, but IIRC
>>> the two specs don't have the same transformations in many cases. How
>>> compatible are the IDNA2003 vs IDNA2008 transformations?
>>
>> As explained in RFC 5894, in fact there are few characters that are
>> interpreted differently in IDNA2008 compared to IDNA2003: Eszett, Greek
>> Final Sigma, Zero Width Joiner, and Zero Width Non-Joiner.
>>
>> So, for instance, in IDNA2003 you could register fussball.de but not
>> fußball.de because ß was mapped to "ss" (see Appendix B of RFC 3454, i.e.
>> "Table B.2" as invoked by Nameprep in RFC 3491). In IDNA2008, ß is a
>> distinct, allowable character, so you can now register fußball.de. Clearly
>> the registrar for .de needs to know this when accepting registrations,
>> because it might want to reserve fußball.de if fussball.de is already
>> registered, automatically assign fußball.de to the registrant for
>> fussball.de, or apply some other policy.
>>
>> Now, the same is true for Nodeprep as used to stringprep the localpart of
>> JIDs -- see Appendix A of RFC 3920. So in the current XMPP network (RFC
>> 6122), you could register an account like fussball@example.com but if you
>> tried to register fußball@example.com it would be stringprepped to
>> fussball@example.com. If we migrate to 6122bis, fußball would be allowed as
>> a username. Therefore a 6122bis-compliant server might allow both accounts
>> to be registered and might route stanzas from both fussball@example.com and
>> fußball@example.com over an s2s link to your server. But if your
>> 6122-compliant server stringpreps JIDs on incoming stanzas then it would
>> consider both of those JIDs to be the same, since it doesn't consider
>> fußball@example.com to be valid (if your server doesn't stringprep JIDs on
>> incoming stanzas then it would return a<jid-malformed/>  error instead).
>> Clearly this opens up the possibility of some attacks -- if I know you are
>> subscribed to fussball@example.com for the latest scores, I could register
>> fußball@example.com and send you bogus information.
>
> This makes a bit nervous.

Me too. But it's unavoidable if we decide to migrate from stringprep to 
PRECIS, and we already have this issue for domain names (it's just that 
we assume the zone administrators will take care of it for us).

> I assume this affects more than just XMPP? I'm interested in hearing
> what non-XMPP folks might have to say on the matter.

Well, I think we have more deployment of stringprep than any other 
non-IDN application technology -- I don't hear about much problems with 
non-ASCII characters in the context of IMAP or POP using SASLprep, or of 
LDAPprep, or of the iSCSI stringprep profile. However, this is something 
that definitely needs to be discussed in the PRECIS WG.

>> As mentioned, this applies to four characters that are allowed in IDNA2008
>> and PRECIS (with the caveat that PRECIS isn't done yet!) but that are mapped
>> to other characters (ß mapped to ss, ς mapped to σ) or to nothing (for Zero
>> Width Joiner and Zero Width Non-Joiner) in IDNA2003 and Nodeprep. In these
>> four cases, the post-stringprep technologies are more inclusive and we could
>> have problems of the kind I've outlined above (not "double JIDs" but certain
>> new JIDs registered with 6122bis-compliant servers that would be treated as
>> equivalent to old JIDs by 6122-compliant software). Any migration plan we
>> devise will need to provide guidelines for handling these cases.
>
> +1.
>
>> All of this is a lot easier if existing servers reject JIDs that they
>> consider malformed, instead of prepping them. However, note that RFCs 3920
>> and 6120 don't say that a server MUST reject malformed JIDs, so some
>> existing servers might be liberal in what they accept, which in this case
>> leads to bad consequences.
>
> I think that's what existing servers probably do. We can check/ask
> them. The ones which do stringprep anyway :)

Yes, it would be good to complete a survey of existing server 
implementations on this point.

Peter

-- 
Peter Saint-Andre
https://stpeter.im/