Re: [I18ndir] I18ndir last call review of draft-ietf-regext-dnrd-objects-mapping-06

John C Klensin <john-ietf@jck.com> Fri, 06 March 2020 19:21 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5DC703A082F for <i18ndir@ietfa.amsl.com>; Fri, 6 Mar 2020 11:21:16 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id KrGQnPRX6OhB for <i18ndir@ietfa.amsl.com>; Fri, 6 Mar 2020 11:21:15 -0800 (PST)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1C3C63A082E for <i18ndir@ietf.org>; Fri, 6 Mar 2020 11:21:15 -0800 (PST)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1jAIX3-0006X6-FE; Fri, 06 Mar 2020 14:21:13 -0500
Date: Fri, 06 Mar 2020 14:21:07 -0500
From: John C Klensin <john-ietf@jck.com>
To: "Asmus Freytag (c)" <asmusf@ix.netcom.com>, i18ndir@ietf.org
Message-ID: <91043CCAA53599B4127C94D9@PSB>
In-Reply-To: <2e0fb274-95e6-0ad5-29f2-36df1f0ab5d4@ix.netcom.com>
References: <158343520135.15044.10991712449156105132@ietfa.amsl.com> <9CD56DEFBC9108D9620ED61E@PSB> <2cb9e78f-32dc-3e2f-ba1a-6ae0218f3ef9@ix.netcom.com> <78B490AE833098E23541E672@PSB> <b10e418c-aa00-669d-68cf-03bb0ef0920b@ix.netcom.com> <19196892ADC7F5919DA7CE7A@PSB> <2e0fb274-95e6-0ad5-29f2-36df1f0ab5d4@ix.netcom.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/FQudqgB5IZLcWAsK-_lAh2H7k_0>
Subject: Re: [I18ndir] I18ndir last call review of draft-ietf-regext-dnrd-objects-mapping-06
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 06 Mar 2020 19:21:17 -0000


--On Friday, March 6, 2020 08:42 -0800 "Asmus Freytag (c)"
<asmusf@ix.netcom.com> wrote:

> On 3/6/2020 7:45 AM, John C Klensin wrote:
>> [2] It might be of interest that there are two reasons we
>> don't have "_" in LDH-style domain names because they were
>> prohibited in the original host name syntax.  And they were
>> prohibited there precisely because of a human factors
>> argument: when reading the handwriting of others, people
>> cannot reliably distinguish between "-" and "_", so it is
>> best to allow one and not the other.  And the hyphen was
>> chosen in part because the most common grapheme for the BCD
>> character that evolved into "_" in ASCII (and ECBDIC) was a
>> left-pointing arrow, not a horizontal bar at the baseline.
>> So the ARPANET decided to go with a single form at the
>> codepoint level while the Unicode decision was to allow
>> different forms and canonicalize later. Of course, we have a
>> different variation on the Unicode decision with "FWS" and
>> "CWFS" constructions in some well-known protocols.
> 
> Unicode's "loose matching" rules are designed to help these
> identifiers to be embedded as needed in other protocols, where
> there are native restrictions. Unicode does provide an
> exhaustive
> list of what those other protocols might be, but it's been
> clear
> that they include, for examples, programming languages.
> 
> Bu making sure identifiers stay unique without requiring
> literal
> hyphens, spaces or underlines, this can be accommodated.

Interesting and helpful.  Of course, the tradeoffs between
defining a canonical form and using and carrying it around as
much as possible (certainly between systems and in archival
formats) versus allowing local forms and converting to canonical
ones only when necessary is an ancient one that won't get
settled today.  In Multics, canonicalization was expected to
occur in the terminal drivers, so it was hard to even enter a
non-canonical form string. 
 
> I was simply speculating that for any 'new' protocol that
> collects data that have a defined format in other protocols,
> it may make sense to not force these data to be normalized
> or canonicalized different from what the external protocol
> expects.
> 
> If external protocols allow multiple formats, then it may make
> sense to pick one canonical form for the 'new' protocol, just
> so data in the new protocol are simplified.
> 
> It depends a bit on the use case. A protocol that's a transient
> container of data transmitted between implementations
> of an external protocol is a different matter from a database
> that collects information from potentially different sources,
> but could be expected to have some internal consistency.

Right.  There is another issue which might be relevant to a
record-keeping (or archival) storage format like this one and
would be different from our "on the wire" expectations.   It
takes us back to the reason the Punycode algorithm was
specified.   If one expects most of all of the code points one
will have to deal with will come out of the BMP, UTF-16 may
actually make more sense that UTF-8.  Part of the motivation for
the Punycode work is that UTF-8 takes up more space for East
Asian scripts than it does for ASCII.    For a file or stream
that is mostly ASCII but with some other characters, UTF-8 is
usually still a win.  But, for a file that is mostly in, say,
Han characters, UTF-16 will typically be much more compact and
will eliminate most or all of the decoding problems associated
with UTF-8.  I still don't think that says "use UTF-8 or
whatever you like" as the I-D appears to say, but it may be a
tad less obvious that UTF-8 is the right answer then for data
that are expected to be transferred regularly over the network.

> ps: if/when we get around to writing this down, the language
> will need to become much more precise than these musings.

Of course.  Always.

best,
  john