Re: [I18ndir] I18ndir last call review of draft-ietf-regext-dnrd-objects-mapping-06

John C Klensin <john-ietf@jck.com> Fri, 06 March 2020 15:45 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 640333A09C1 for <i18ndir@ietfa.amsl.com>; Fri, 6 Mar 2020 07:45:42 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id W_0x-MBroMnD for <i18ndir@ietfa.amsl.com>; Fri, 6 Mar 2020 07:45:38 -0800 (PST)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 395C33A0977 for <i18ndir@ietf.org>; Fri, 6 Mar 2020 07:45:38 -0800 (PST)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1jAFAO-00064u-HS; Fri, 06 Mar 2020 10:45:36 -0500
Date: Fri, 06 Mar 2020 10:45:30 -0500
From: John C Klensin <john-ietf@jck.com>
To: "Asmus Freytag (c)" <asmusf@ix.netcom.com>, i18ndir@ietf.org
Message-ID: <19196892ADC7F5919DA7CE7A@PSB>
In-Reply-To: <b10e418c-aa00-669d-68cf-03bb0ef0920b@ix.netcom.com>
References: <158343520135.15044.10991712449156105132@ietfa.amsl.com> <9CD56DEFBC9108D9620ED61E@PSB> <2cb9e78f-32dc-3e2f-ba1a-6ae0218f3ef9@ix.netcom.com> <78B490AE833098E23541E672@PSB> <b10e418c-aa00-669d-68cf-03bb0ef0920b@ix.netcom.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/t6fjlSkmKqwtgqEP9dIzWqSpWbw>
Subject: Re: [I18ndir] I18ndir last call review of draft-ietf-regext-dnrd-objects-mapping-06
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 06 Mar 2020 15:45:49 -0000

Asmus,

Sorry -- I read and responded to Marc's note before seeing yours
and Patrik's.   I think we are in complete agreement, but see
comments inline below.

--On Friday, March 6, 2020 00:40 -0800 "Asmus Freytag (c)"
<asmusf@ix.netcom.com> wrote:

> On 3/5/2020 9:46 PM, John C Klensin wrote:
>> Asmus,
>> 
>> --On Thursday, March 5, 2020 16:14 -0800 Asmus Freytag
>> <asmusf@ix.netcom.com> wrote:
>> 
>>> On 3/5/2020 12:47 PM, John C Klensin wrote:
>>> 
>>>> In addition, given increasing trend on the web 9at
>>>> least) to do exactly what TUS says to do, which is to
>>>> normalize only at comparison time rather than trying to
>>>> carry strings around in normalized form, the application of
>>>> that attribute all almost all text-type values may be
>>>> inappropriate. I can find no evidence in the I-D that those
>>>> issues were considered; the document should not progress
>>>> until they are.
>>> Right, there are many other contexts where it makes sense to
>>> keep the data as submitted and not to force normalization.
>>   
>>> However, this does not appear to be one of them.
>> For the many fields that do not appear to be IDN labels, why
>> not?  From skimming the document, I'd assume that the IDN
>> labels should be required to be in NFC but that a
>> normalization requirement is probably inappropriate for most
>> other fields, especially those fields that appear to be free
>> text descriptions.  Which ones probably requires a
>> field-by-field analysis.
>> 
>>> We may need a statement specific to IETF that captures the
>>> suggested policy.
>> Indeed.
>> 
> I did not do a full review, just checking some of the points
> that
> others had noted against the original.
> 
> That means I have not looked into the details of all the text
> fields. As a matter of policy, if we were to develop one to
> help
> us guide in our review, we may perhaps partition:

I think not just our reviews, but the advice we give to WGs
developing protocols and data formats.  That, of course, gets
back to the question as to whether (and when) it will be
appropriate to take the next steps beyond statements like "just
use UTF-8" and "just normalize everything" [1] and toward "don't
normalize (or tamper with case) unless you are actually about to
compare or there is a particular protocol-specific reason to do
so" and "don't use or allow anything other than UTF-8 unless
there is good reason to do so and you can document that reason"

> (1) IDN U-labels
> 
> (2) Free text (description, comment)
> 
> (3) Other, protocol relevant
 
> It makes sense to require (1) to be in the format specified by
> IDNA (NFC and lowercase), and likewise it does not make sense
> for (2) to be normalized.

Right.   See also my note to Marc.

To try to keep our language precise, for IDN U-labels, the
format is not only a matter of making sense.  By the definition
-- which was written very carefully and intentionally -- if a
string is not NFC and lowercase (and conformant to most other
IDNA requirements) it is simply not a U-label.  That is the
reason why parts of the IDNA2008 specs are sprinkled with terms
like "putative U-label".

But, as far as I can tell from another quick pass through this
document, your case (1) does not exist at all or may be
violated: domain objects, as described in Section 5.1.1.1, use
A-labels and not U-labels; "U-label" does not appear in the
document at all.  However, NNDN elements (there and elsewhere)
are Unicode strings and, after reading the definition of NNDN in
Section 1 a couple of times, I can't tell whether their
components are expected to be IDNA2008-valid U-labels or not.
IMO, it is another deficiency in this I-D that this is not clear
and that either U-labels must be required or there must be more
explanation why not.

> String that may be relevant as part of some protocol, but are
> not
> IDN U-labels, would be RECOMMENDED to be in whatever
> format the protocol sets out for them. (There's a small
> benefit to be able to use non-protocol aware tools to do
> things like search data sets, and therefore it's preferable to
> have the data in the expected form for easy comparison,
> even if the protocol describes its own preprocessing step.

Right.  However, there is a separate recommendation for the
protocols themselves.  We should we try to minimize the number
of different preferred forms floating around.  For your case (2)
there is no preferred form.  But for the third case and
especially because most of the people designing protocols and
implementing libraries that are not i18n-specific cannot be
expected to be i18n or Unicode experts and because we have a
great deal of experience with strings "leaking" between protocol
contexts, having many different preferred formats out there is
an invitation to confusion and bugs.  The tradeoffs may need to
be evaluated carefully on a case-by-case basis, but simplicity
in the number of options and formats may trump optimality for
the needs of a particular protocol.

> To give an example: Unicode allows for character and property
> names to elide spaces with or without camel case or to use -
> or _ as separators. However, in the context of the standard
> as much as possible only one of these formats will be used
> (single case with space as separator for character names and
> _ separator for properties).

Hmm.  Thanks, I just learned something because I had not studied
(or had forgotten) that set of rules.  [2]

 
> While everybody knows how to compare the values, it helps
> the reader to know that the text of the standard follows a
> uniform convention.
> 
> If a protocol allows mulitple forms, as Unicode does for its
> identifiers, it would still be good form to pick a canonical
> representation for such protocol-relevant items and as
> reviewers we should have a policy on that.

Yes.  And I think it is more than "as reviewers".  As discussed
in earlier notes, we may be getting to the point that it is time
for us to recommend, and the IETF to establish, some clear
guidance on these subjects so that we minimize the number of
situations in which WGs develop work and then we come along
during IETF Last Call and tell them they messed up.  Patrik may
be right that we need more reviews and responses before we are
ready for that, but my sense is that we are fairly close.

best,
   john


[1]  I am not familiar enough with the Regext set of documents
to know if "normalizedString" is defined somewhere.  Certainly,
draft-ietf-regext-dnrd-objects-mapping-06 uses it without
defining it locally or providing a specific normative reference
to such definitions.  At minimum, this suggests another problem
with the I-D that may need fixing.  But, if there is not a clear
definition -- if the instruction for what should go into those
fields is essentially "just normalize", than that is obviously
bad news.  It is bad news even for IDN U-labels because a string
in, e.g., NFC would not conform to the requirements for
U-labels.  Normal IETF and RFC practice requires a statement,
probably in the Terminology section and probably with a
normative reference, to where definitions of that and other
attributes can be found.

[2] It might be of interest that there are two reasons we don't
have "_" in LDH-style domain names because they were prohibited
in the original host name syntax.  And they were prohibited
there precisely because of a human factors argument: when
reading the handwriting of others, people cannot reliably
distinguish between "-" and "_", so it is best to allow one and
not the other.  And the hyphen was chosen in part because the
most common grapheme for the BCD character that evolved into "_"
in ASCII (and ECBDIC) was a left-pointing arrow, not a
horizontal bar at the baseline.  So the ARPANET decided to go
with a single form at the codepoint level while the Unicode
decision was to allow different forms and canonicalize later.
Of course, we have a different variation on the Unicode decision
with "FWS" and "CWFS" constructions in some well-known protocols.