Re: [I18ndir] I-D on filesystem I18N

Nico Williams <nico@cryptonector.com> Wed, 08 July 2020 01:29 UTC

Return-Path: <nico@cryptonector.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 33AAE3A0CF8 for <i18ndir@ietfa.amsl.com>; Tue, 7 Jul 2020 18:29:55 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.099
X-Spam-Level:
X-Spam-Status: No, score=-2.099 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cryptonector.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id NP4Gu1TByKA4 for <i18ndir@ietfa.amsl.com>; Tue, 7 Jul 2020 18:29:52 -0700 (PDT)
Received: from bonobo.dogwood.relay.mailchannels.net (bonobo.dogwood.relay.mailchannels.net [23.83.211.22]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 809443A0CF5 for <i18ndir@ietf.org>; Tue, 7 Jul 2020 18:29:52 -0700 (PDT)
X-Sender-Id: dreamhost|x-authsender|nico@cryptonector.com
Received: from relay.mailchannels.net (localhost [127.0.0.1]) by relay.mailchannels.net (Postfix) with ESMTP id 8DFFE342B26; Wed, 8 Jul 2020 01:29:51 +0000 (UTC)
Received: from pdx1-sub0-mail-a38.g.dreamhost.com (100-96-9-37.trex.outbound.svc.cluster.local [100.96.9.37]) (Authenticated sender: dreamhost) by relay.mailchannels.net (Postfix) with ESMTPA id 0B68C342AEF; Wed, 8 Jul 2020 01:29:51 +0000 (UTC)
X-Sender-Id: dreamhost|x-authsender|nico@cryptonector.com
Received: from pdx1-sub0-mail-a38.g.dreamhost.com (pop.dreamhost.com [64.90.62.162]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384) by 0.0.0.0:2500 (trex/5.18.8); Wed, 08 Jul 2020 01:29:51 +0000
X-MailChannels-SenderId: dreamhost|x-authsender|nico@cryptonector.com
X-MailChannels-Auth-Id: dreamhost
X-Stretch-Dime: 3cb35ccc74bf4097_1594171791517_1065421362
X-MC-Loop-Signature: 1594171791517:2473857625
X-MC-Ingress-Time: 1594171791517
Received: from pdx1-sub0-mail-a38.g.dreamhost.com (localhost [127.0.0.1]) by pdx1-sub0-mail-a38.g.dreamhost.com (Postfix) with ESMTP id C1506B47A0; Tue, 7 Jul 2020 18:29:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=cryptonector.com; h=date :from:to:cc:subject:message-id:references:mime-version :content-type:in-reply-to:content-transfer-encoding; s= cryptonector.com; bh=QRXh3oYj1mxz0uUS0Tcpfr7Q4aE=; b=OtwdFy7AMTh 80A2B/gFSTQ8zmbx/PsjzRyNNsHFzWathe/u77zIjexNMBFUf1yFAgsLCFmUGCHX eNQ1ytKycPhsMtVen9Fr9eNata0p7peKBNwH1KEwbGlxCHYk/+fvrQjJlmMi9nYl VXl0oMs64HQP4gDJGna+lC1iFxBDu6L0=
Received: from localhost (unknown [24.28.108.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: nico@cryptonector.com) by pdx1-sub0-mail-a38.g.dreamhost.com (Postfix) with ESMTPSA id F0463B4799; Tue, 7 Jul 2020 18:29:48 -0700 (PDT)
Date: Tue, 07 Jul 2020 20:29:45 -0500
X-DH-BACKEND: pdx1-sub0-mail-a38
From: Nico Williams <nico@cryptonector.com>
To: Asmus Freytag <asmusf@ix.netcom.com>
Cc: i18ndir@ietf.org
Message-ID: <20200708012944.GY3100@localhost>
References: <20200706225139.GJ3100@localhost> <90740541-ab72-ffaf-ff3e-5a27b5805eae@ix.netcom.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Disposition: inline
In-Reply-To: <90740541-ab72-ffaf-ff3e-5a27b5805eae@ix.netcom.com>
User-Agent: Mutt/1.9.4 (2018-02-28)
X-VR-OUT-STATUS: OK
X-VR-OUT-SCORE: -100
X-VR-OUT-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgeduiedrudeigdegjecutefuodetggdotefrodftvfcurfhrohhfihhlvgemucggtfgfnhhsuhgsshgtrhhisggvpdfftffgtefojffquffvnecuuegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjughrpeffhffvuffkfhggtggugfgjfgesthekredttderudenucfhrhhomheppfhitghoucghihhllhhirghmshcuoehnihgtohestghrhihpthhonhgvtghtohhrrdgtohhmqeenucggtffrrghtthgvrhhnpeehheeftdeliefhudeuieevuefffefhtdekjedvhefhkeffvdejleegtdduudetkeenucffohhmrghinhepuhhnihgtohguvgdrohhrghdpohhrrggtlhgvrdgtohhmpdhivghtfhdrohhrghenucfkphepvdegrddvkedruddtkedrudekfeenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhhouggvpehsmhhtphdphhgvlhhopehlohgtrghlhhhoshhtpdhinhgvthepvdegrddvkedruddtkedrudekfedprhgvthhurhhnqdhprghthheppfhitghoucghihhllhhirghmshcuoehnihgtohestghrhihpthhonhgvtghtohhrrdgtohhmqedpmhgrihhlfhhrohhmpehnihgtohestghrhihpthhonhgvtghtohhrrdgtohhmpdhnrhgtphhtthhopehnihgtohestghrhihpthhonhgvtghtohhrrdgtohhm
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/QxR6sUlDu8d0QZOk0rSxj0JQGOM>
Subject: Re: [I18ndir] I-D on filesystem I18N
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 08 Jul 2020 01:29:55 -0000

On Tue, Jul 07, 2020 at 04:43:42PM -0700, Asmus Freytag wrote:
> On 7/6/2020 3:51 PM, Nico Williams wrote:
> > I've submitted draft-williams-filesystem-18n-00.
> 
> Here are my comments:

Thanks!

> >     This document describes requirements for internationalization (I18N)
> >     of filesystems specifically in the context of Internet protocols, the
> >     architecture for filesystems in most currently popular general
> >     purpose operating systems, and their implications for filesystem
> >     I18N.  From the I18N requirements for filesystems and the
> 
> The first sentence doesn't scan - the constructions joined by "and"
> seemingly are not parallel enough.

I think the "their" in "and their implications" is unclear as to
antecedent.  I'll wordsmith it.

> >     [TBD: Add references galore.  How to reference Unicode?  How to
> 
> If you go on the Unicode site and look for the page on the latest version
> (currently https://www.unicode.org/versions/Unicode13.0.0/), you can
> navigate to suggested ways to reference. I would cite the latest version as
> of the time of drafting of the I-D and then also write: "the latest version
> is available at" and give the URL as http:\\www.unicode.org\versions\latest.

The question was about bibxml specifically, though I should have made
that clear.  I failed to update that comment -- I did figure out how to
reference Unicode, and did reference it.

I do still need to add a number of other references.  And for the two
books I referenced I should add section numbers.

> If you need to reference specifically the properties, the character database
> is at https://www.unicode.org/Public/13.0.0/ or 
> https://www.unicode.org/Public/UCD/latest/ -- You could also link to UAX#44,
> which is the overview of the Unicode Character Database. The "latest" is at:
> https://www.unicode.org/reports/tr44/ and the current one, as of today, is
> at https://www.unicode.org/reports/tr44/tr44-26.html (see the "This version"
> link for the latest).

In XML terms... you're saying I should use <eref>, I think.

> >   To deal with the equivalence problem, Unicode defines Normal Forms
> Unicode calls these "Normalization Forms", see Section 3.11 in TUS (The
> Unicode Standard) so that is what should be used when capitalized. Ditto for
> the formal names for NFC and NFD.
> 
> I think you need to mention already here that NFC/NFD represent a semantic
> identity (and generally identical appearance) while NFKC and NFCD abstract
> away rather noticeable differences in appearance that may in some cases
> imply strong semantic differences to some users (e.g. math alphabets).

I chose to do so one page later because I wanted to keep the intro
brief:

|   Unicode compatibility equivalence allows equivalence between
|   different representations of the same abstract character that may
|   nonetheless have different visual appearance of behavior.  There are
|   two canonical forms that support compatibility equivalence: NFKC and
|   NFKD.  Using NoCL with NFKC or NFKD may be surprising to users in a
|   visual way.  While form-insensitivity with NFKC or NFKD may surprise
|   users who might consider two file names distinct even when Unicode
|   considers them equivalent under compatibility equivalence.  The
|   latter seems less likely and less surprising, though that is an
|   entirely subjective judgement.

Here "NoCL" == normalize-on-CREATE-and-LOOKUP, defined just above the
quoted text.

Does that work for you?

> >   Unicode compatibility equivalence allows equivalence between
> >     different representations of the same abstract character that may
> >     nonetheless have different visual appearance of behavior.  There are
> >     two canonical forms that support compatibility equivalence: NFKC and
        ^^^^^^^^^^^^^
> >     NFKD.  Using NoCL with NFKC or NFKD may be surprising to users in a
> >     visual way.  While form-insensitivity with NFKC or NFKD may surprise
> >     users who might consider two file names distinct even when Unicode
> >     considers them equivalent under compatibility equivalence.  The
> >     latter seems less likely and less surprising, though that is an
> >     entirely subjective judgement.
>
> You really MUST not use the term "canonical" with NFKC and NFKD because for
> Unicode, the two forms NFC and NFD are considered "canonical" and that term
> is used contrastively to "compatibility".

The word 'canonical' was preceded by 'two' -- I think that's pretty
clear!  :)

> There's an "of" that should be an "or", but it's not just behavior - it's
> also meaning. Mathematicians would object strenuously to having their math
> alphabets "normalized" to standard A-Z.

Ay, yes, thanks!

> NFK(C/D) is useful in a different way: if it used to disallow code points
> that aren't stable under these normalization forms, then one sidesteps the
> issue of whether the distinct appearance etc. is meaningful, but without
> ever changing a filename (which would happen if it was normalized when
> stored). (There may be no file-systems that take this approach, nevertheless
> it's worth discussing as it is used in other naming schemes).

ZFS (in Solaris and most ports, but not OS X) form-insensitivity, and
lets the user choose a form for use for this.

See https://docs.oracle.com/cd/E71909_01/html/E71919/gpssl.html which
shows that NFC, NFD, NFKC, and NFKC.

It doesn't make much sense to offer NFC or NFKC for this given that the
behavior is form-insensitive and form-preserving (NFC is specified as
requiring canonical decomposition as a first step, which means that NFC
is very likely slower than NFD, and the same applies to NFKC and NFKD).
But whatever.

In at least the original port to OS X, before Apple abandoned it, ZFS
was made to normalize-on-CREATE (and LOOKUP) instead of being
form-preserving.  This was probably done to match HFS+'s behavior.  I
remember a post by Linus Torvalds complaining bitterly about this and
blaming ZFS in general rather than Apple, when some user complained that
Git did not handle this case correctly.  The issue was that Git looked
for the file using the name it knew from the repository, and that was in
NFC because of course the input mode used when the file was committed
produced something NFC-ish, but on HFS+ this got converted to NFD, and
Git then failed to find the file by memcmp() in the directory listing.
Ironically, that very possibility had been my motivation for pushing for
form-insensitive and form-preserving behavior in ZFS in Solaris to begin
with, but Apple engineers insisted on the HFS+ behavior in their port.

Apple's choice of NFD makes sense for performance reasons since it's
faster to normalize to form D than to form C (see above).  But
normalizing on CREATE to NFD conflicts with input modes that tend to
produce pre-composed character codepoints.  Form-insensitivity is the
obvious compromise.  Until the Git incident, I had no evidence of that
actually causing problems.

Git nowadays contains normalization code to deal with this, whereas it
could have not had any if we'd had consensus on form-insensitivity a few
years before we added it to ZFS.  Bummer.  Still, form-insensitivity is
clearly the better behavior, though I'm not proposing that we mandate
it.

> >     foldings are defined by Unicode.  Generally, case-insensitive
> >     filesystems preserve original case just form-insensitive filesystems
> >     preserve original form.
>
> There's an "as" missing.

Thanks :)

> However, early case-insensitive file systems did not preserve case. Not sure
> how rare this has become.

So.. sentence-case?  I refer to that later:

|   The only way to implement I18N behaviors in the VFS layer rather than
|   at the filesystem is to abandon form- and case-preserving behaviors.
|   For case-insensitivity this would require using sentence-case, or all
|   lower-case, perhaps, and all such choices would surely be surprising
|   to users.  At any rate, that approach would also render much running
|   code "non-compliant" with any Internet filesystem protocol I18N
|   specification.

> >     listings that work the same way as on the server.  We do not specify
> >     any case foldings here.  Instead we will either create a registry of
>
> the "here" is unclear. Does it refer to recommendation in this ID relevant
> to caching clients? If so, link to section.

'here' == "this document".  I'll clarify.

> >    "just-use-8" or "just-use-16" (as in UTF-16 [UNICODE  <https://tools.ietf.org/html/draft-williams-filesystem-18n-00#ref-UNICODE>]), with no
> >     attempt at normalization or case folding done anywhere in between.
>
> In Unicode parlance: "just use strings of code units".

"just-use-8" means strings of 8-bit words that could be... any 8-bit
codeset, or UTF-8, or whatever.

"just-use-16" is almost certainly "as in UCS-2" or "as in UTF-16",
though for all I know there may be some non-Unicode 16-bit wchar_t
codesets in use as "just-use-16".

> Specifically for UTF-8, this would imply that there are also no guarantees
> of well-formedness of the UTF-8 strings (likewise for surrogate pairs in
> UTF-16).

Well, yes.  But that wasn't what I was alluding to.  I was referring to,
e.g., using Linux (or *BSD, or Solaris/Illumos) in, say, an ISO-8859-1
locale.  In that case you might have an application all `creat(2)` with
a `pathname` that contains non-ASCII that is also not UTF-8.

The C library system call stub will *not* convert this to UTF-8 or
anything else, nor will the kernel-land system call know anything about
the locale in use in user-land.  The given `pathname` string will just
be a pointer to a zero-terminated array of `char` in user-land, and will
be copied into the kernel as such by the kernel-side of the system call,
then it will be passed to the VFS.

The VFS will interpret just two byte values specially: 0x00 (ASCII NULL)
and 0x2F (ASCII '/').  The filesystems will generally only interpret
0x00 specially, though some (e.g., ZFS) may reject strings that are not
valid UTF-8.

We call this "just-use-8" in some circles (e.g., in the KITTEN WG).  I
don't recall who coined that.

I expect something similar happens in Windows for both, the *A()
(just-use-8?) and *W() functions (just-use-16?).  But I'm not
sufficiently experienced a Windows programmer to really know.

"just-use-16" is my coinage as an obvious variation of "just-use-8".

> -- reached end of section 1 and out of time slot --

Thanks so much!

Nico
--