Re: [xmpp] review of draft-ietf-xmpp-6122bis-12

Peter Saint-Andre <stpeter@stpeter.im> Fri, 29 August 2014 17:30 UTC

Return-Path: <stpeter@stpeter.im>
X-Original-To: xmpp@ietfa.amsl.com
Delivered-To: xmpp@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id DAFF91A06A3; Fri, 29 Aug 2014 10:30:07 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.671
X-Spam-Level:
X-Spam-Status: No, score=-2.671 tagged_above=-999 required=5 tests=[BAYES_40=-0.001, GB_I_LETTER=-2, RP_MATCHES_RCVD=-0.668, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id nJ8fkq0vxtrz; Fri, 29 Aug 2014 10:30:05 -0700 (PDT)
Received: from stpeter.im (mailhost.stpeter.im [207.210.219.225]) by ietfa.amsl.com (Postfix) with ESMTP id C95121A068D; Fri, 29 Aug 2014 10:30:04 -0700 (PDT)
Received: from aither.local (unknown [73.34.202.214]) (Authenticated sender: stpeter) by stpeter.im (Postfix) with ESMTPSA id D20A241278; Fri, 29 Aug 2014 11:30:50 -0600 (MDT)
Message-ID: <5400B894.8050007@stpeter.im>
Date: Fri, 29 Aug 2014 11:29:56 -0600
From: Peter Saint-Andre <stpeter@stpeter.im>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: "Joe Hildebrand (jhildebr)" <jhildebr@cisco.com>, "xmpp@ietf.org" <xmpp@ietf.org>, "precis@ietf.org" <precis@ietf.org>
References: <CFFEBEEE.575AE%jhildebr@cisco.com>
In-Reply-To: <CFFEBEEE.575AE%jhildebr@cisco.com>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
Archived-At: http://mailarchive.ietf.org/arch/msg/xmpp/zmYJLlFs3qakFr9gGmAAtYvb_8A
Subject: Re: [xmpp] review of draft-ietf-xmpp-6122bis-12
X-BeenThere: xmpp@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: XMPP Working Group <xmpp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/xmpp>, <mailto:xmpp-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/xmpp/>
List-Post: <mailto:xmpp@ietf.org>
List-Help: <mailto:xmpp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/xmpp>, <mailto:xmpp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 29 Aug 2014 17:30:08 -0000

Hi Joe, thanks for the review and my apologies for taking a month to reply.

On 7/30/14, 3:25 PM, Joe Hildebrand (jhildebr) wrote:
> The reasons the precis group got a spate of questions from me today was I
> was prepping to do this review.  There are a couple of issues that the
> precis folk should pay more attention to.
>
>   > 1.  Introduction
> ...
>
>   >    Instead, this document builds upon the
>   >    internationalization framework defined by the IETF's PRECIS Working
>   >    Group [I-D.ietf-precis-framework], while attempting to ensure that
>   >    the characters allowed in Jabber IDs under stringprep are still
>   >    allowed and handled in the same way under PRECIS.
>
> "the same way" means more backward-compatibility to me than I think we
> intend here.

Yes, that is a bit vague, even though it does say "attempting". Here is 
one possible approach...

OLD

    Instead, this document builds upon the
    internationalization framework defined by the IETF's PRECIS Working
    Group [I-D.ietf-precis-framework], while attempting to ensure that
    the characters allowed in Jabber IDs under stringprep are still
    allowed and handled in the same way under PRECIS.

NEW

    Instead, this document builds upon the
    internationalization framework defined by the IETF's PRECIS Working
    Group [I-D.ietf-precis-framework].  Although every attempt has been
    made to ensure that the characters allowed in Jabber IDs under
    stringprep are still allowed and handled in the same way under
    PRECIS, there is no guarantee of strict backward compatibility
    because of changes in Unicode and the fact that PRECIS handling is
    based on Unicode properties, not a hardcoded table of characters.

>   > 3.1.  Fundamentals
>   >
>   >       jid           = [ localpart "@" ] domainpart [ "/" resourcepart ]
>   >       localpart     = 1*1023(localpoint)
>   >                       ;
>   >                       ; a "localpoint" is a UTF-8 encoded
>   >                       ; Unicode code point that conforms to
>   >                       ; the "JIDlocalIdentifierClass" profile
>   >                       ; of the PRECIS IdentifierClass
>   >                       ;
>
> This implies 1023 codepoints, not 1023 bytes to me. Same issue for ifqdn
> and resourcepart.  6122 just had 1*; I think going back to that would be
> fine since we have a rule below that captures the max size.

Your proposal seems fine to me, too. It's hard to capture these nuances 
in ABNF, at times. Although, from later in the thread, "localbyte" would 
work for me.

>   > 3.2.  Domainpart
>   >
>   >    The domainpart of a JID is that portion after the '@' character (if
>   >    any) and before the '/' character (if any); it is the primary
>
> I think it's often surprising to people that foo/@bar is a valid JID with
> "foo" as the domainpart and "@bar" as the resourcepart.  The text above,
> although pulled from 6122, might be better as:
>
> The domainpart of a JID is that portion after the first '@' character (if
> any) and before the first '/' character (if any);

That's acceptable to me.

> and possibly adding the example.

Examples are good. I'll add a few to Table 1.

>   >    In general, the content of a domainpart is an Internationalized
>   >    Domain Name ("IDN") as described in the specifications for
>   >    Internationalized Domain Names in Applications (commonly called
>   >    "IDNA2008"), and a domainpart is an "IDNA-aware domain name slot" as
>   >    defined in [RFC5890].  The following rules apply to a domainpart that
>   >    consists of a fully-qualified domain name and MUST be applied in the
>   >    following order:
>
> When do these rules need to be applied? Only before comparison or routing?

That is a very good question.

This might be a difference between the "preparation" and "comparison" of 
the PRECIS acronym.

You'll notice that the PRECIS nickname spec draws a sharper distinction 
between preparation and comparison than the others:

http://www.ietf.org/archive/id/draft-ietf-precis-nickname-09.txt

Section 2 there says in part:

    For preparation purposes (most commonly, when a chatroom client
    generates a nickname from user input for inclusion as a protocol
    element that represents a "nickname slot"), an application MUST at a
    minimum ensure that the string conforms to the "FreeformClass" string
    class defined in [I-D.ietf-precis-framework]; however, it MAY in
    addition perform the normalization and mapping operations specified
    below for comparison purposes.

    For comparison purposes (e.g., when a chatroom server determines if
    two nicknames are in conflict during the authorization process), an
    application MUST treat a nickname as specified below (these rules
    constitute the "NicknameFreeformClass" profile).  The operations
    specified MUST be completed in the order shown (in particular,
    normalization MUST be performed after the other mapping steps and
    before validity-checking against the definition of the PRECIS
    "FreeformClass", consistent with [I-D.ietf-precis-framework]).

    [various rules elided]

I wonder if we want to say, in general, that there is something of a 
lower bar for preparation than for comparison. For example, for an XMPP 
localpart we might say that an entity doing preparation just needs to 
ensure that it doesn't include any characters outside of the PRECIS 
IdentifierClass, whereas an entity doing comparison needs to apply the 
normalization and mapping rules. The primary reason we might do this is 
that it could ease the burden on XMPP clients or servers during certain 
operations, whereas at those times when comparison is truly needed 
(e.g., when user authentication or authorization are being made) the 
full set of rules would be applied.

Although I'm not entirely comfortable with this approach, pragmatically 
it might be more acceptable than saying that all entities must apply all 
of the rules all of the time.

This is related to text in Section 4:

    Enforcement of the XMPP address format rules is the responsibility of
    XMPP servers.  Although XMPP clients SHOULD prepare complete JIDs and
    parts of JIDs in accordance with this document before including them
    in protocol slots within XML streams (such that JIDs and parts of
    JIDs are in conformance), XMPP servers MUST enforce the rules
    wherever possible and reject stanzas and other XML elements that
    violate the rules (for stanzas, by returning a <jid-malformed/> error
    to the sender as described in Section 8.3.3.8 of [RFC6120]).

That text seems to imply the same principle: clients prepare and servers 
enforce (by mean of comparison?). But I think we could be clearer about 
the whole matter by explicitly saying that enforcement includes 
application of all the rules (just as comparison does - it's just that 
comparison involves applying all of the rules to two strings in order to 
determine if they are "equivalent", whereas enforcement involves apply 
the rules to a single string).

>   >    1.  The domainpart MUST contain only NR-LDH labels and U-labels as
>   >        defined in [RFC5890] and MUST consist only of Unicode code points
>   >        that conform to the rules specified in [RFC5892] (which includes
>   >        Unicode normalization).  This implies that the domainpart MUST
>   >        NOT include A-labels as defined in [RFC5890]; each A-label MUST
>   >        be converted to a U-label during preparation of a domainpart, and
>   >        comparison MUST be performed using U-labels, not A-labels.
>
> This seems like an always rule, including for dumb clients.

Things are a bit more clear-cut with regard to rules that are based on 
PRECIS, not IDNA, because the models are slightly different. In PRECIS 
we have base string classes (IdentifierClass and FreeformClass), so it 
might make sense to say that preparation involves ensuring that the 
preparing entity doesn't allow in any code points that are disallowed 
for that base string class. We don't have base string classes in IDNA. 
Although the foregoing rule is similar to the base string class idea, it 
goes beyond by including normalization. I'd almost prefer that we figure 
this out very clearly first for PRECIS-based identifiers (in XMPP, the 
localpart and resourcepart) and then see how the resulting text can be 
ported over to our use of IDNA-based identifiers (in XMPP, the domainpart).

>   >    2.  All uppercase and titlecase code points within the domainpart
>   >        MUST be mapped to their lowercase equivalents, preferably using
>   >        Unicode Default Case Folding as defined in Chapter 3 of the
>   >        Unicode Standard [UNICODE].
>
> Dumb clients might get away with this and the system would still work.
>
>   >    3.  Fullwidth and halfwidth characters within the domainpart MUST be
>   >        mapped to their decomposition mappings.
>
> Dumb clients have no shot at this one.

Right - in the emerging approach we're exploring here, the latter two 
rules would be a matter of enforcement and comparison only, not of 
preparation.

>   >       Implementation Note: The foregoing order is different from the
>   >       order for localparts and resourceparts as described below, to
>   >       maintain consistency with the IDNA methods in both [RFC5892] and
>   >       [RFC5895].
>   >
>   >    After any and all normalization, conversion, and mapping of code
>   >    points,
>
> as well as conversion to UTF-8.

True, although we kind of assume that in the XMPP world because all data 
sent over an XMPP stream is required to be UTF-8. Mentioning it seems 
useful, though.

>   >    a domainpart MUST NOT be zero octets in length and MUST NOT
>   >    be more than 1023 octets in length.  (Naturally, the length limits of
>   >    [RFC1034] apply, and nothing in this document is to be interpreted as
>   >    overriding those more fundamental limits.)
>   >
>   > 3.3.  Localpart
>   >
>   >    The localpart of a JID is an optional identifier placed before the
>   >    domainpart and separated from the latter by the '@' character.
>   >    Typically a localpart uniquely identifies the entity requesting and
>   >    using network access provided by a server (i.e., a local account),
>   >    although it can also represent other kinds of entities (e.g., a chat
>   >    room associated with a multi-user chat service [XEP-0045]).  The
>   >    entity represented by an XMPP localpart is addressed within the
>   >    context of a specific domain (i.e., <localpart@domainpart>).
>   >
>   >    A localpart MUST NOT be zero octets in length and MUST NOT be more
>   >    than 1023 octets in length.  This rule is to be enforced after any
>   >    normalization and mapping of code points.
>
> and conversion to UTF-8.

As above.

>   >    A localpart MUST consist only of Unicode code points that conform to
>   >    the "JIDlocalIdentifierClass" profile of the "IdentifierClass" base
>   >    string class defined in [I-D.ietf-precis-framework].  The
>   >    JIDlocalIdentifierClass profile includes all code points allowed by
>   >    the IdentifierClass base class, with the exception of the following
>   >    characters that are explicitly disallowed in XMPP localparts:
>
> (special precis focus)
> I would have expected this to be phrased more similarly to step 2 of
> http://tools.ietf.org/html/draft-ietf-precis-framework-17#section-5, or
> for section 5 to just have a step about codepoints forbidden in a given
> usage of the selected precis class.

Good point - I agree that more internal harmony would be helpful here 
between the framework and the various profiles.

>   >    The normalization and mapping rules for the JIDlocalIdentifierClass
>   >    are as follows, where the operations specified MUST be completed in
>   >    the order shown:
>
> Again, I think we need language about when these rules are applied.  The
> rest of the section is about what is allowed, not about how to compare.

As discussed above, I think we need to more clearly delineate what's 
required for preparation, what's required for enforcement, and what's 
required for comparison. And as mentioned seems to me right now that the 
same rules are involved in enforcement and comparison, except that 
applying those rules during enforcement is a way to determine if a 
single string conforms, whereas applying those rules during comparison 
is a way to determine if two strings are "equivalent". That said, your 
use of the phrase "about what is allowed, not about how to compare" 
might suggest that more is involved in comparison than in enforcement.

To choose a simple example, is the JID <StPeter@jabber.org> "allowed" if 
the jabber.org server enforces all the rules for a localpart? It seems 
to me not. We're saying that a client could send that (since both "S" 
and "P" are allowed by the category "Lu - Uppercase_Letter" and thus 
would pass the preparation test), but that a server which is enforcing 
the rules would map "S" to "s" and "P" to "p". However, the rules for 
comparison are the same as for enforcement: "StPeter" and "stpeter" 
would compare as equivalent.

>   >    1.  Fullwidth and halfwidth characters MUST be mapped to their
>   >        decomposition mappings.
>   >
>   >    2.  Uppercase and titlecase characters MUST be mapped to their
>   >        lowercase equivalents, preferably using Unicode Default Case
>   >        Folding as defined in Chapter 3 of the Unicode Standard
>   >        [UNICODE].
>
> Nothing about SpecialCasing?

That's a question for the WG. :-)

The PRECIS framework states:

    If case mapping is desired (instead of case preservation), it is
    RECOMMENDED to use Unicode Default Case Folding as defined in Chapter
    3 of the Unicode Standard [Unicode6.3].

       Note: Unicode Default Case Folding is not designed to handle
       various localization issues (such as so-called "dotless i" in
       several Turkic languages).  The PRECIS mappings document
       [I-D.ietf-precis-mappings] describes these issues in greater
       detail and defines a "local case mapping" method that handles some
       locale-dependent and context-dependent mappings.

Given the discussions in recent PRECIS WG meetings, I would shy away 
from applying locale-dependent and context-dependent mappings in XMPP 
localparts. However, I'm open to argument.

>   >    A resourcepart MUST NOT be zero octets in length and MUST NOT be more
>   >    than 1023 octets in length.  This rule is to be enforced after any
>   >    normalization and mapping of code points.
>   >
>   >    A resourcepart MUST consist only of Unicode code points that conform
>   >    to the "JIDresourceFreeformClass" profile of the "FreeformClass" base
>   >    string class defined in [I-D.ietf-precis-framework].
>   >
>   >    The normalization and mapping rules for the resourcepart of a JID are
>   >    as follows, where the operations specified MUST be completed in the
>   >    order shown:
>
> Again, when are the rules applied?

See above.

>   >    1.  Fullwidth and halfwidth characters MAY be mapped to their
>   >        decomposition mappings.
>
> (precis)
> I need a hint as to when do this.  "MAY" isn't nearly enough.

Do you mean "when" as "in what contexts is it smart to do width mapping 
on resourceparts" or as something else (e.g., "when" could mean "by 
which entities" such as clients, servers, and XMPP "components").

Later in this thread, you and Florian Zeitz seem to think that MUST NOT 
perform width mapping is the right approach.

However, resourceparts are used in multiple contexts (we could say that 
there are multiple "resourcepart slots").

For the JIDs of connected resources (user@domain/foo), I tend to agree.

For the JIDs of chatroom participants, the precis-nickname spec says to 
use NFKC, which handles width mapping as part of normalization (and thus 
might be taken to violate the proposed MUST NOT approach).

I haven't yet taken the time to find and think about other resourcepart 
slots in various XMPP extensions, but I hesitate to make a categorical 
statement in 6122bis since the applicability of width mapping might 
depend on the context in which a resourcepart is used.

>   >    2.  Map any instances of non-ASCII space to ASCII space (U+0020).
>
> (precis)
> I was hoping either the framework doc or the mappings doc would tell me
> more about which characters to map here.  RFC 3454 had table C.1.2, but I
> don't see any hints about what I'm supposed to do now.

Good catch.

> Is the rule "has a
> compatibility mapping to U+0020"?

BTW I count at least three kinds of compatibility mapping to 0020: 
<compat> (as in U+0384 GREEK TONOS), <noBreak> (as in U+2007 FIGURE 
SPACE)), and <wide> (as in U+3000 IDEOGRAPHIC SPACE).

> That doesn't hit U+200B which is in
> C.1.2,

Right. I am not sure whether ZERO WIDTH SPACE really ought to be mapped 
to U+0020. See Florian's comment later in this thread.

> nor does "has category Zs".

IMHO that is insufficient.

My intuition is that by "non-ASCII space" we mean anything that has a 
compatibility mapping of any kind of U-0020, since that seems safest (it 
casts a wider net) and is something we can apply in a programmatic way. 
However, my intuitions are not always correct and applying this rule 
this would result in a larger table than what we find in Appendix C.1.2 
of RFC 3454.

> draft-ietf-precis-mappings says
> "Therefore, the special mapping table should be based on a well-
>     defined mapping table for each protocol", which although I don't
> particularly like, I can live with - but we need the table here.

Do you feel that we need the table in 6122bis or in the framework? As 
you say, the mappings document implies that each specification that 
defines a rule like "map non-ASCII space to ASCII space" needs to define 
their own table, but that seems like a recipe for trouble. If, say, SASL 
and XMPP and LDAP each defines a different table, authentication might 
become confusing (especially since XMPP uses SASL and authentication 
might be based on an LDAP lookup).

>   >    3.  So-called additional mappings MAY be applied, such as mapping of
>   >        characters that are similar to common delimiters (such as '@',
>   >        ':', '/', '+', '-', and '.', e.g., mapping of IDEOGRAPHIC FULL
>   >        STOP (U+3002) to FULL STOP (U+002E)) and special handling of
>   >        certain characters or classes of characters (e.g., mapping of
>   >        non-ASCII spaces to ASCII space); the PRECIS mappings document
>   >        [I-D.ietf-precis-mappings] describes such mappings in more
>   >        detail.
>   >
>   >    4.  Uppercase and titlecase characters MAY be mapped to their
>   >        lowercase equivalents, preferably using Unicode Default Case
>   >        Folding as defined in Chapter 3 of the Unicode Standard
>   >        [UNICODE].
>
> Again, I need more about the MAY here.
>
>   > 6.  IANA Considerations
>   >
>   >    The following completed templates provide the information necessary
>   >    for the IANA to add 'JIDlocalIdentifierClass' and
>   >    'JIDresourceFreeformClass' to the PRECIS Profiles Registry.
>
> Should we also ask them to mark the status of nodeprep and resourceprep to
> deprecated in the stringprep profiles registry?

Yes.

>   > Appendix A.  Differences from RFC 6122
>   >
>   >    Based on consensus derived from working group discussion,
>   >    implementation and deployment experience, and formal interoperability
>   >    testing, the following substantive modifications were made from RFC
>   >    6122.
>
> I think it might be nice to point out that this may have made
> previously-valid JIDs no longer valid (or vice-versa), and that we suggest
> careful testing before migrating user data.

+1 to at least that text. Ideally we'd perform the kind of analysis that 
Takahiro Nemoto performed for SASLprep vs. SASLprepbis:

http://www.ietf.org/mail-archive/web/precis/current/msg00790.html

I haven't done that yet, though.

Thanks again to you and Florian for your careful reviews.

Peter