Re: [precis] Review of current document draft-ietf-precis-framework-20

Peter Saint-Andre - &yet <peter@andyet.net> Mon, 08 December 2014 03:49 UTC

Return-Path: <peter@andyet.net>
X-Original-To: precis@ietfa.amsl.com
Delivered-To: precis@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 79E2E1A1BB1 for <precis@ietfa.amsl.com>; Sun, 7 Dec 2014 19:49:09 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.591
X-Spam-Level:
X-Spam-Status: No, score=-1.591 tagged_above=-999 required=5 tests=[BAYES_50=0.8, GB_I_LETTER=-2, MIME_8BIT_HEADER=0.3, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, T_FILL_THIS_FORM_SHORT=0.01] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LDmpyAZcu82W for <precis@ietfa.amsl.com>; Sun, 7 Dec 2014 19:49:04 -0800 (PST)
Received: from mail-ie0-f175.google.com (mail-ie0-f175.google.com [209.85.223.175]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id E8F2E1A1BB7 for <precis@ietf.org>; Sun, 7 Dec 2014 19:49:03 -0800 (PST)
Received: by mail-ie0-f175.google.com with SMTP id x19so3741596ier.6 for <precis@ietf.org>; Sun, 07 Dec 2014 19:49:03 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :subject:references:in-reply-to:content-type :content-transfer-encoding; bh=qlWjz+M/HF4QPGhdXpKb4ieB0GCQT2y1jxHe0t8Aygc=; b=Yhtvf7hHgp6vNs5ifBhJi9zQxzlBiHR9xojm7Iy3z2vPAaKf5/GCL/Y2OSLSfWtfhK M2ctNf0yZW/OiGIbvmkRN9o28dP2fettt5YFPnphWVBBHRcPHXFGEc0wB9RiJvQX/hR8 9iZsRmMrSDkRfg++NkjFwOqxW5WUfvbXqjaJCRVk+Td71RtbXAw7Sinx0XAiCWLvw2/j DCC9oDzrz5DZmWdYgm69QYQqTZoXejy7QuMWpiphQXh6Uh1q4puJyic5rgtBMTdcNAgl 4B+vCQ8ySRXZJX+W9fvHB0H9cTFjEStetseHiuuA3YTNlEJfdfTK1Xd1WSjtFhKpWyDY SWHw==
X-Gm-Message-State: ALoCoQnWcXMYHQx6Rwy9aH2nNcSCRqGq84gzZ6tCxYribVhOOJOCo7uW5OUJCvS9rGuro5MjPfCZ
X-Received: by 10.107.170.162 with SMTP id g34mr9092460ioj.2.1418010543339; Sun, 07 Dec 2014 19:49:03 -0800 (PST)
Received: from aither.local (c-73-34-202-214.hsd1.co.comcast.net. [73.34.202.214]) by mx.google.com with ESMTPSA id bd5sm3271889igb.14.2014.12.07.19.49.02 for <multiple recipients> (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 07 Dec 2014 19:49:02 -0800 (PST)
Message-ID: <54851FAD.4040009@andyet.net>
Date: Sun, 07 Dec 2014 20:49:01 -0700
From: Peter Saint-Andre - &yet <peter@andyet.net>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: Patrik Fältström <paf@frobbit.se>, precis@ietf.org
References: <13AD33F4-1741-472E-8423-1B0302734325@frobbit.se>
In-Reply-To: <13AD33F4-1741-472E-8423-1B0302734325@frobbit.se>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 8bit
Archived-At: http://mailarchive.ietf.org/arch/msg/precis/t8tzNW0CMVVsoFKF80kzA0hT0OU
Subject: Re: [precis] Review of current document draft-ietf-precis-framework-20
X-BeenThere: precis@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <precis.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/precis>, <mailto:precis-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/precis/>
List-Post: <mailto:precis@ietf.org>
List-Help: <mailto:precis-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/precis>, <mailto:precis-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 08 Dec 2014 03:49:09 -0000

On 11/23/14, 7:58 AM, Patrik Fältström wrote:
> I have on request from the chairs done a review of three Precis
> documents. Here is the review of draft-ietf-precis-framework-20.

Patrik, many thanks for the thorough review!

> The review has focused on the use of Unicode and the relationship
> with IDNA2008.
>
> You see my comments below.
>
> Best, Patrik Fältström
>
>> draft-ietf-precis-framework-20
>>
>> Abstract
>>
>> Application protocols using Unicode characters in protocol strings
>> need to properly handle such strings in order to enforce
>> internationalization rules for strings placed in various protocol
>> slots (such as addresses and identifiers) and to perform valid
>> comparison operations (e.g., for purposes of authentication or
>> authorization).  This document defines a framework enabling
>> application protocols to perform the preparation, enforcement, and
>> comparison of internationalized strings ("PRECIS") in a way that
>> depends on the properties of Unicode characters and thus is agile
>> with respect to versions of Unicode.  As a result, this framework
>> provides a more sustainable approach to the handling of
>> internationalized strings than the previous framework, known as
>> Stringprep ( RFC 3454).  This document obsoletes RFC 3454 .
>
> Explanation on what my review is concentrating on:
>
> When using a character set like Unicode where things like
> transformation, comparisons etc, the actual transformation can happen
> in multiple locations in the architecture, and what is needed is for
> applications to understand what format the strings are that are
> received and what format strings are expected to be in when being
> sent. Of course the principle of "be liberal in what you accept, and
> conservative in what you send" is very important.
>
> A simplified sketch of the architecture is as follows:
>
> 1. [A] sends a string to [B] for storage
>
> 2. [C] send a string to be for lookup, which implies matching
> algorithm is applied
>
> 3. If there is a match, data is sent back to [C]
>
> In this very simple way of looking at the issues the questions
> include:
>
> A. What Unicode code points should [A] accept as input
>
> B. What transformation is [A] expected to do?
>
> C. What transformation is [A] allowed to do?
>
> D. What Unicode code points is [A] allowed to send to [B]?
>
> E. What Unicode code points can [B] expect from [A]?
>
> F. What transformation is [B] expected to do on data sent from [A]
> before data is stored in the database?
>
> : :
>
> I.e. it must be as clear as possible for each one of the parties A, B
> and C what they are expected to do, what they must (and must not)
> do.

For the profiles and application usages that go along with the framework 
(draft-ietf-precis-saslprepbis, draft-ietf-precis-nickname, 
draft-ietf-xmpp-6122bis), we have tried to address the considerations 
you raise. However, I think that in several places we could do a better 
job, such as:

i. Explain not only what entities are expected to do, but also what they 
are allowed to do.

ii. Specify what various using applications need to do (e.g., I don't 
think we've made that fully clear in the saslprepbis and nickname 
documents, although I think it is very clear in 6122bis).

These are matters that we can focus on during WGLC for those specs, I think.

> And of course ultimately the issue/trouble is that the Unicode
> Character Set is created and designed in such a way that there are
> many equivalences that humans in various contexts do expect are to be
> treated as "the same". Of course different in different contexts.

Indeed.

> Now to the review...
>
>> 4.  Enable application protocols to define profiles of the PRECIS
>> string classes if necessary (addressing matters such as width
>> mapping, case mapping, Unicode normalization, and directionality)
>> but strongly discourage the multiplication of profiles beyond
>> necessity in order to avoid violations of the Principle of Least
>> User Astonishment.
>
> It must also be clear who has the responsibility to do whatever
> transformations needed.

Yes. See above and below in this message, other messages in this thread, 
and the forthcoming revised I-D.

>> It is expected that this framework will yield the following
>> benefits:
>>
>> o  Application protocols will be agile with regard to Unicode
>> versions.
>>
>> o  Implementers will be able to share code point tables and
>> software code across application protocols, most likely by means
>> of software libraries.
>>
>> o  End users will be able to acquire more accurate expectations
>> about the characters that are acceptable in various contexts.
>> Given this more uniform set of string classes, it is also expected
>> that copy/paste operations between software implementing different
>> application protocols will be more predictable and coherent.
>
> It must also to everyone involved be clear what is the normative
> authoritative source for what is allowed and not.
>
> For IDNA2008 (for example) it is the _algorithm_ that is normative.
> Not any tables that are derivatives from applying the algorithm on a
> specific version of the Unicode Character Set.

Agreed. That's why the introduction states:

    The character categories and calculation rules defined under
    Section 7 and Section 8 are normative and apply to all Unicode code
    points.  The code point table that results from applying the
    character categories and calculation rules to the latest version of
    Unicode can be found in an IANA registry.

>> When an application applies a profile of a PRECIS string class, it
>> can achieve the following objectives:
>>
>> a.  Determine if a given string conforms to the profile, thus
>> enabling enforcement of the rules (e.g., to determine if a string
>> is allowed for use in the relevant protocol slot specified by an
>> application protocol).
>>
>> b.  Determine if any two given strings are equivalent, thus
>> enabling comparision (e.g., to make an access decision for purposes
>> of authentication or authorization as further described in [
>> RFC6943 ]).
>
> And of course applying transformation on a string being received
> before it is passed on to the next step of whatever process the
> application is participating in. In this case, the string is in one
> form before the transformation and another form after the
> transformation. It must also be clear to everyone involved that
> transformations applied very seldom (if at all) are reversible.
> Specifically this is the case when doing case folding
> transformations.

Ack.

>> 3.  Preparation, Enforcement, and Comparison
>>
>>
>> This document distinguishes between three different actions that
>> an entity can take with regard to a string:
>>
>> o  Enforcement entails applying all of the rules specified for a
>> particular string class or profile thereof to an individual string,
>> for the purpose of determining if the string can be used in a given
>> protocol slot.
>>
>> o  Comparison entails applying all of the rules specified for a
>> particular string class or profile thereof to two separate strings,
>> for the purpose of determining if the two strings are equivalent.
>
> In fact I think the "comparison" entitles three steps:
>
> 1. Apply all transformation and enforcement on string A.
>
> 2. Apply all transformation and enforcement on string B.
>
> 3. Compare the strings A and B unicode character by character. Only
> if all characters are the same, a positive match is the result.

Agreed. I'll adjust the text along those lines.

>> o  Preparation entails only ensuring that the characters in an
>> individual string are allowed by the underlying PRECIS string
>> class.
>
> I think the idea with "preparation" is to apply certain
> transformation and to, after transformation, ensure all characters in
> the context they exist, are allowed, so that the final string after
> the preparation step is a valid precis string?

During discussion of draft-ietf-xmpp-6122bis in the XMPP WG, several 
commenters noted that they would prefer to assign more lightweight 
processing to entities that are not capable of full enforcement of all 
the rules for a profile (i.e., not heavy or advanced tasks like Unicode 
normalization). Out of that conversation emerged the concept of 
"preparation" as involving only limits on the character ranges.

Perhaps we could have chosen a less ambiguous term than "preparation"?

> I would recommend explicitly mentioning the fact (destructive)
> transformation might occur in this step.

Destructive transformation doesn't sound good. Do you have examples of 
what that means?

>> In most cases, authoritative entities such as servers are
>> responsible for enforcement, whereas subsidiary entities such as
>> clients are responsible only for preparation.  The rationale for
>> this distinction is that clients might not have the facilities (in
>> terms of device memory and processing power) to enforce all the
>> rules regarding internationalized strings (such as width mapping
>> and Unicode normalization), although they can more easily limit the
>> repertoire of characters they offer to an end user.  By contrast,
>> it is assumed that a server would have more capacity to enforce the
>> rules, and in any case acts as an authority regarding allowable
>> strings in protocol slots such as addresses and endpoint
>> identifiers.  In addition, a client cannot necessarily be trusted
>> to properly generate such strings, especially for
>> security-sensitive contexts such as authentication and
>> authorization.
>
> This paragraph is very vague. I think the protocol need a much
> stricter specification on who is expected to do what. This because
> the protocol itself (that is for example between client and server)
> must be robust enough to carry whatever code points the client is
> using.

Yes, and that's what each application usage or profile needs to specify 
very clearly. We can provide some guidelines for such text in the 
framework, but I don't think we can provide that text for all 
applications here.

>> Valid:  Defines which code points and character categories are
>> treated as valid input to the string.
>
> The term "input" is not clear to me, given transformation might
> occur.

Good point. Will fix.

>> Disallowed:  Defines which code points and character categories
>> need to be excluded from the string.
>
> It is a bit confusing to talk about both categories and code points
> at the same time. I would recommend in this point in time in the
> document talk about what code points are disallowed. Reason for this
> is that you might have a category that is disallowed while the code
> point that is of that category is allowed (based on other rules, like
> exceptions). To make it crystal clear what is disallowed, I recommend
> only use that term for code points.

Yes, we'll change that.

>> 4.2.1.  Valid
>>
>>
>> o  Code points traditionally used as letters and numbers in
>> writing systems, i.e., the LetterDigits ("A") category first
>> defined in [ RFC5892] and listed here under Section 8.1 .
>>
>> o  Code points in the range U+0021 through U+007E, i.e., the
>> (printable) ASCII7 ("K") rule defined under Section 8.11 .  These
>> code points are "grandfathered" into PRECIS and thus are valid even
>> if they would otherwise be disallowed according to the
>> property-based rules specified in the next section.
>>
>> Note: Although the PRECIS IdentifierClass re-uses the LetterDigits
>> category from IDNA2008, the range of characters allowed in the
>> IdentifierClass is wider than the range of characters allowed in
>> IDNA2008.  The main reason is that IDNA2008 applies the Unstable
>> category before the LetterDigits category, thus disallowing
>> uppercase characters, whereas the IdentifierClass does not apply
>> the Unstable category.
>
> You must remove the code points of class ("C") in RFC5892.

The only difference between class "C" (IDNA2008) and class "M" (PRECIS) 
is that in PRECIS we allow white space code points in the FreeformClass. 
Thus we use "C", not "M".

> Or to state things differently. If one look at the Unicode tables,
> the following combination of matches exists for code points that
> matches category "A" and at least one more category, for Unicode
> 7.0.0:
>
> AB ABC ABF AC ACI AD AE AF AI
>
> There are several of these combinations that is given this definition
> is valid which I would not say is recommended for use for
> identifiers.

OK, I will look at this more closely.

> Further, regarding not including stable. This implies it is allowed
> to use code points in Precis that are not stable regarding
> normalization and/or case folding. The normalization and/or case
> folding still must be made somewhere before matching is happening.
>
> Lets for example say that "A" and "a" are both valid (which they
> would be). The question is then whether there is case mapping before
> comparison or not, and if there is, it must be ensured that the two
> identities "A" and "a" are not both created in some name space.
>
> I know this is exactly what you have been talking about and
> discussing, but it must be absolutely crystal clear everyone
> understand what this implies. Specifically when case folding (lower
> case) is replaced with NFC or some normalization algorithm.

I think this is handled more directly by the UsernameCaseMapped and 
UsernameCasePreserved profiles of the IdentifierClass specified in 
draft-ietf-precis-saslprepbis.

> More about this later.

There is always more to be said about i18n. :-)

>> 4.2.3.  Disallowed
>
> See above.
>
>> Some application technologies need strings that can be used in a
>> free-form way, e.g., as a password in an authentication exchange
>> (see [ I-D.ietf-precis-saslprepbis ]) or a nickname in a chatroom
>> (see [ I-D.ietf-precis-nickname ]).  We group such things into a
>> class called "FreeformClass" having the following features.
>>
>> Security Warning: As mentioned, the FreeformClass prioritizes
>> expressiveness over safety; Section 11.3 describes some of the
>> security hazards involved with using or profiling the
>> FreeformClass.
>>
>> Security Warning: Consult Section 11.6 for relevant security
>> considerations when strings conforming to the FreeformClass, or a
>> profile thereof, are used as passwords.
>
> There are very dangerous issues here when using this class for any
> kind of comparison.  Specifically in the case of password and user
> names (or file names)

We actively discourage the use of the FreeformClass (or profiles 
thereof) for user names, file names, and the like.

See however draft-ietf-precis-nickname, where we use a profile of the 
FreeformClass for nicknames (e.g., in a chatroom).

> where it is unclear what kind of normalization
> might happen between "the keyboard" and "the application". I.e. the
> user might really really think they enter a certain code point, but
> in reality what the application see is either NFC(string) or
> NFD(string) and which one might vary on the operating system (or file
> system) in use. Specifically when leaving this undefined.
>
> I am all in favor of leaving this undefined for this class, but then
> it might not be the best to do any kind of matching (including
> searching). Unless some kind of transformation is made for the
> matching/searching.

We might need some more text about these matters in the profiles that 
use the FreeformClass: OpaqueString (in saslprepbis) and Nickname. 
Another topic to fix during WGLC.

> I would recommend the following general rules:
>
> - IdentifierClass are used wherever it is importan the namespace
> include only globally unique strings, like identifiers for user names
> etc

Agreed.

> - IdentifierClass are also used for passwords

That might not be consistent with the need for entropy in passwords. See 
draft-ietf-precis-saslprepbis on this point.

> and whenever a
> comparison is used,

Ideally, passwords are not stored in the clear, which case some kind of 
transformation will need to happen before any comparison is done. That's 
why draft-ietf-precis-saslprepbis states:

    In protocols that provide passwords as input to a cryptographic
    algorithm such as a hash function, the client will need to perform
    proper preparation of the password before applying the algorithm,
    since the password is not available to the server in plaintext form.

> but the transformation should not be
> destructive.

Here again I'm not sure exactly what that means.

> - FreeformClass is used for storage of various things
>
> - Protocols must be stable for FreeformClass in the transport

If you have time, could you expand upon those two points a bit?

>> 4.3.1.  Valid
>
> See comments above regarding combination of A with other categories.
>
>> 5.1.  Profiles Must Not Be Multiplied Beyond Necessity
>>
>>
>> The risk of profile proliferation is significant because having
>> too many profiles will result in different behavior across various
>> applications, thus violating what is known in user interface
>> design as the Principle of Least Astonishment.
>>
>> Indeed, we already have too many profiles.  Ideally we would have
>> at most two or three profiles.  Unfortunately, numerous
>> application protocols exist with their own quirks regarding
>> protocol strings.
>>
>> Domain names, email addresses, instant messaging addresses,
>> chatroom nicknames, filenames, authentication identifiers,
>> passwords, and other strings are already out there in the wild and
>> need to be supported in existing application protocols such as DNS,
>> SMTP, XMPP, IRC, NFS, iSCSI, EAP, and SASL among others.
>>
>> Nevertheless, profiles must not be multiplied beyond necessity.
>>
>> To help prevent profile proliferation, this document recommends
>> sensible defaults for the various options offered to profile
>> creators (such as width mapping and Unicode normalization).  In
>> addition, the guidelines for designated experts provided under
>> Section 9 are meant to encourage a high level of due diligence
>> regarding new profiles.
>
> What are the requirements to create a new Profile?

Rule #1: Don't create new profiles.

Rule #2: Don't ignore Rule #1.

Rule #3: If you push on through the first two rules, provide a strong 
justification for the new profile and answer all the questions under the 
IANA considerations in the framework. Even then, expect a lot of 
pushback from the designated experts.

> Either there are requirements or not. This text above does not add
> much help if there is a conflict in the future regarding request for
> registration of a new Profile. Are you really happy with what is
> above?

I'm happy with the totality of information and requirements in Sections 
5.1, 9, and 10.3 of framework-20.

> The WG must honestly say "yes" to this. If they do, I am happy! :-)

I can't speak for the WG. :-)

>> 5.2.1.  Width Mapping Rule
>>
>>
>> The width mapping rule of a profile specifies whether width
>> mapping is performed on fullwidth and halfwidth characters, and how
>> the mapping is done.  Typically such mapping consists of mapping
>> fullwidth and halfwidth characters, i.e., code points with a
>> Decomposition Type of Wide or Narrow, to their decomposition
>> mappings; as an example, FULLWIDTH DIGIT ZERO (U+FF10) would be
>> mapped to DIGIT ZERO (U+0030).
>>
>> The normalization form specified by a profile (see below) has an
>> impact on the need for width mapping.  Because width mapping is
>> performed as a part of compatibility decomposition, a profile
>> employing either normalization form KD (NFKD) or normalization
>> form KC (NFKC) does not need to specify width mapping.  However,
>> if Unicode normalization form C (NFC) is used (as is recommended)
>> then the profile needs to specify whether to apply width mapping;
>> in this case, width mapping is in general RECOMMENDED because
>> allowing fullwidth and halfwidth characters to remain unmapped to
>> their compatibility variants would violate the Principle of Least
>> Astonishment.  For more information about the concept of width in
>> East Asian scripts within Unicode, see Unicode Standard Annex #11
>> [ UAX11 ].
>
> Doing this mapping is not easy. I strongly recommend an algorithm is
> presented that either is in use or not by the profile.

We've been trying not to define new algorithms. Is there one we can can 
point to elsewhere?

>> 5.2.3.  Case Mapping Rule
>>
>>
>> The case mapping rule of a profile specifies whether case mapping
>> is performed (instead of case preservation) on uppercase and
>> titlecase characters, and how the mapping is done (e.g., mapping
>> uppercase and titlecase characters to their lowercase
>> equivalents).
>
> You either apply mapping or not, to _all_ code points. The above make
> it sort of look like if case mapping is sometimes not to be
> performed.

Would we ever apply case mapping to characters other than upperase and 
titlecase characters? Perhaps you mean only to point out that the rule 
is applied to all code points, but only uppercase and titlecase code 
points are transformed by the rule. If so, the text is easy to adjust.

>> If case mapping is desired (instead of case preservation), it is
>> RECOMMENDED to use Unicode Default Case Folding as defined in
>> Chapter 3 of the Unicode Standard [ Unicode7.0 ].
>>
>> Note: Unicode Default Case Folding is not designed to handle
>> various localization issues (such as so-called "dotless i" in
>> several Turkic languages).  The PRECIS mappings document [
>> I-D.ietf-precis-mappings ] describes these issues in greater detail
>> and defines a "local case mapping" method that handles some
>> locale-dependent and context-dependent mappings.
>>
>> In order to maximize entropy and minimize the potential for false
>> positives, it is NOT RECOMMENDED for application protocols to map
>> uppercase and titlecase code points to their lowercase equivalents
>> when strings conforming to the FreeformClass, or a profile
>> thereof, are used in passwords; instead, it is RECOMMENDED to
>> preserve the case of all code points contained in such strings and
>> then perform case-sensitive comparison.  See also the related
>> discussion in [ I-D.ietf-precis-saslprepbis ].
>
> The above is too complicated.

It is long-winded, but is it more complicated than your terse versino below?

> The only realistic way of handling casing is to:
>
> 1. Decide whether there is case insensitive matching to be done or
> not
>
> 2. If it is, case fold to lower case before the matching
>
> 3. If transformation of a string is made, case fold to lower case as
> part of the transformation

If I understand the difference between (2) and (3) correctly, (2) 
applies to case-insensitive matching (e.g., an application might 
preserve case on storing strings but compare two output strings in a 
case-insensitive way), whereas (3) applies to storage of strings in some 
canonical format.

> 4. Do not forget issues with normalization and case folding both be
> applied on the same string

Yes, I will give more thought to that and see if we can add some 
appropriate text.

>> 5.2.5.  Directionality Rule
>>
>>
>> The directionality rule of a profile specifies how to treat
>> strings containing left-to-right (LTR) and right-to-left (RTL)
>> characters (see Unicode Standard Annex #9 [ UAX9 ]).  A profile
>> usually specifies a directionality rule that restricts strings to
>> be entirely LTR
>> strings or entirely RTL strings and defines the allowable
>> sequences of characters in LTR and RTL strings.  Possible rules
>> include, but are not limited to, (a) considering any string that
>> contains a right- to-left code point to be a right-to-left string,
>> or (b) applying the "Bidi Rule" from [ RFC5893 ].
>
> One can not restrict to only LTR or RTL as some code points are
> neutral regarding directionality.
>
> See RFC5893. This was one of the mistakes in IDNA2003.

I think you're right that the foregoing text is not quite correct.

And in any case, as John Klensin reiterated at the mic in a recent 
PRECIS WG session, if we define something other than the Bidi Rule then 
we'll likely get it wrong. I think that text predates John's comments 
and hasn't been updated..

>> Mixed-direction strings are not directly supported by the PRECIS
>> framework itself, since there is currently no widely accepted and
>> implemented solution for the safe display of mixed-direction
>> strings.
>
> Define Mixed-Direction strings or else the text will be confusing.

Will do.

>> username   = userpart *(1*SP userpart) userpart   = 1*(idbyte) ; ;
>> an "idbyte" is a byte used to represent a ; UTF-8 encoded Unicode
>> code point that can be ; contained in a string that conforms to
>> the ; PRECIS "IdentifierClass" ;
>
> Do not talk about "byte" here but instead "character". So, in the
> grammar, talk about Unicode Code Points. How the Unicode string is
> then encoding (for example UTF-8) is a different issue.

Hmm. All of the new profiles or application usages specify UTF-8. This 
issue of bytes/octets vs. characters caused us problems in XMPP (in part 
because we specify length limits). I'll need to look at it more closely 
before making a change.

>> 6.  Order of Operations
>>
>>
>> To ensure proper comparison, the rules specified for a particular
>> string class or profile MUST be applied in the following order:
>
> It must, as I started with saying above, be clear when these
> operations take place. Is it a (destructive) transformation done
> somewhere (application, client side, server side) or just something
> part of a matching algorithm?

Yes, we've tried to address that in profile and application specs.

>> 8.3.  IgnorableProperties (C)
>>
>>
>> This category is defined in Secton 2.3 of [ RFC5892 ] but is not
>> used in PRECIS.
>>
>> Note: See the "PrecisIgnorableProperties (M)" category below for a
>> more inclusive category used in PRECIS identifiers.
>
> See comments above.
>
>> 8.7.  BackwardCompatible (G)
>>
>>
>> This category is defined in Secton 2.7 of [ RFC5892 ] and is
>> included by reference for use in PRECIS.
>>
>> Note: Because of how the PRECIS string classes are defined, only
>> changes that would result in code points being added to or removed
>> from the LetterDigits ("A") category would result in backward-
>> incompatible modifications to code point assignments.
>
> Are you sure? I am not.

I think we just leave that up to RFC 5892 and remove that note.

>> Therefore, management of this category is handled via the processes
>> specified in [ RFC5892 ].  At the time of this writing (and also at
>> the time that
>>
>> RFC 5892 was published), this category consisted of the empty set;
>> however, that is subject to change as described in RFC 5892 .
>
> This is true on the other hand.
>
>> 8.13.  PrecisIgnorableProperties (M)
>>
>>
>> This PRECIS-specific category is used to group code points that
>> are discouraged from use in PRECIS string classes.
>>
>> M: Default_Ignorable_Code_Point(cp) = True or
>> Noncharacter_Code_Point(cp) = True
>>
>> The definition for Default_Ignorable_Code_Point can be found in
>> the DerivedCoreProperties.txt [ 2 ] file, and at the time of
>> Unicode 7.0 is as follows:
>>
>> Other_Default_Ignorable_Code_Point + Cf (Format characters) +
>> Variation_Selector - White_Space - FFF9..FFFB (Annotation
>> Characters) - 0600..0604, 06DD, 070F, 110BD (exceptional Cf
>> characters that should be visible)
>
> I would be very nervous over having explicit code points in a generic
> rule like this. If the code points are to be listed explicitly, add
> them to an exception rule. Otherwise *this* rule have to be changed
> when future changes have to be made that would require otherwise
> exceptions to be added.

That simply repeats what Unicode 7.0 says, but we can remove it to 
prevent developer confusion.

>> 8.14.  Spaces (N)
>>
>>
>> This PRECIS-specific category is used to group code points that
>> are space characters.
>>
>> N: General_Category(cp) is in {Zs}
>
> I am still thinking of how "spaces" is handled here. Will check and
> think more when I look at the mapping documents. Specifically when
> you also look at Arabic and other similar scripts. Just do
> destructive transformation to U+0020 is not always working I think.

We have had a lot of discussion about space characters, not really 
captured all that well in the specs. But see Section 5.5:

5.5.  A Note about Spaces

    With regard to the IdentiferClass, the consensus of the PRECIS
    Working Group was that spaces are problematic for many reasons,
    including:

    o  Many Unicode characters are confusable with ASCII space.

    o  Even if non-ASCII space characters are mapped to ASCII space
       (U+0020), space characters are often not rendered in user
       interfaces, leading to the possibility that a human user might
       consider a string containing spaces to be equivalent to the same
       string without spaces.

    o  In some locales, some devices are known to generate a character
       other than ASCII space (such as ZERO WIDTH JOINER, U+200D) when a
       user performs an action like hitting the space bar on a keyboard.

    One consequence of disallowing space characters in the
    IdentifierClass might be to effectively discourage their use within
    identifiers created in newer application protocols; given the
    challenges involved with properly handling space characters
    (especially non-ASCII space characters) in identifiers and other
    protocol strings, the PRECIS Working Group considered this to be a
    feature, not a bug.

    However, the FreeformClass does allow spaces, which enables
    application protocols to define profiles of the FreeformClass that
    are more flexible than any profiles of the IdentifierClass.  In
    addition, as explained in the previous section, application protocols
    can also define application-layer constructs containing spaces.

>> 8.17.  HasCompat (Q)
>>
>>
>> This PRECIS-specific category is used to group code points that
>> have compatibility equivalents as explained in Chapter 2 and
>> Chapter 3 of the Unicode Standard [ Unicode7.0 ].
>>
>> Q: toNFKC(cp) != cp
>>
>> The toNFKC() operation returns the code point in normalization
>> form KC.  For more information, see Section 5 of Unicode Standard
>> Annex #15 [ UAX15 ].
>
> Think about implications of this together with case folding (or
> not).

I shall. :-)

Thanks again for the review. I'll work to update the spec soon.

Peter

-- 
Peter Saint-Andre
CTO @ &yet
https://andyet.com/