Re: [I18ndir] I-D on filesystem I18N

Nico Williams <nico@cryptonector.com> Wed, 08 July 2020 03:09 UTC

Return-Path: <nico@cryptonector.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D40C33A102C for <i18ndir@ietfa.amsl.com>; Tue, 7 Jul 2020 20:09:43 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.1
X-Spam-Level:
X-Spam-Status: No, score=-2.1 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cryptonector.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LH4SdRsg_89D for <i18ndir@ietfa.amsl.com>; Tue, 7 Jul 2020 20:09:41 -0700 (PDT)
Received: from anteater.oak.relay.mailchannels.net (anteater.oak.relay.mailchannels.net [23.83.215.3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 03E103A0FFD for <i18ndir@ietf.org>; Tue, 7 Jul 2020 20:09:40 -0700 (PDT)
X-Sender-Id: dreamhost|x-authsender|nico@cryptonector.com
Received: from relay.mailchannels.net (localhost [127.0.0.1]) by relay.mailchannels.net (Postfix) with ESMTP id DB8024012A1; Wed, 8 Jul 2020 03:09:39 +0000 (UTC)
Received: from pdx1-sub0-mail-a38.g.dreamhost.com (100-96-7-22.trex.outbound.svc.cluster.local [100.96.7.22]) (Authenticated sender: dreamhost) by relay.mailchannels.net (Postfix) with ESMTPA id 44993400486; Wed, 8 Jul 2020 03:09:39 +0000 (UTC)
X-Sender-Id: dreamhost|x-authsender|nico@cryptonector.com
Received: from pdx1-sub0-mail-a38.g.dreamhost.com (pop.dreamhost.com [64.90.62.162]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384) by 0.0.0.0:2500 (trex/5.18.8); Wed, 08 Jul 2020 03:09:39 +0000
X-MailChannels-SenderId: dreamhost|x-authsender|nico@cryptonector.com
X-MailChannels-Auth-Id: dreamhost
X-Well-Made-White: 1814f10c64920c7c_1594177779701_3364806833
X-MC-Loop-Signature: 1594177779700:1349951667
X-MC-Ingress-Time: 1594177779700
Received: from pdx1-sub0-mail-a38.g.dreamhost.com (localhost [127.0.0.1]) by pdx1-sub0-mail-a38.g.dreamhost.com (Postfix) with ESMTP id 06DB1B47B4; Tue, 7 Jul 2020 20:09:39 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=cryptonector.com; h=date :from:to:cc:subject:message-id:references:mime-version :content-type:in-reply-to; s=cryptonector.com; bh=RS/xzLrMkRPsVN U1rZX8ggeG8rY=; b=ewsf0LgwLRNpO20MaFelH9UUBc0hL37O1s5R734SjeZ3HP QVzogSksYMg/XNfMIwG6Ui8ds35LUV9YjGVeYkGCOapaK4/JRsDWIgL9ILxZ5b9I 7FRg+l49QYPhXWgF1BDYr+P3Ek7BVrlWqfEPn5QqhGMwEXXJi6v4FFchFinak=
Received: from localhost (unknown [24.28.108.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: nico@cryptonector.com) by pdx1-sub0-mail-a38.g.dreamhost.com (Postfix) with ESMTPSA id E4AC5B47B2; Tue, 7 Jul 2020 20:09:37 -0700 (PDT)
Date: Tue, 07 Jul 2020 22:09:33 -0500
X-DH-BACKEND: pdx1-sub0-mail-a38
From: Nico Williams <nico@cryptonector.com>
To: Asmus Freytag <asmusf@ix.netcom.com>
Cc: i18ndir@ietf.org
Message-ID: <20200708030932.GD3100@localhost>
References: <20200706225139.GJ3100@localhost> <90740541-ab72-ffaf-ff3e-5a27b5805eae@ix.netcom.com> <20200708012944.GY3100@localhost> <de2ff6a1-7437-96f1-3281-24b41522192b@ix.netcom.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <de2ff6a1-7437-96f1-3281-24b41522192b@ix.netcom.com>
User-Agent: Mutt/1.9.4 (2018-02-28)
X-VR-OUT-STATUS: SPAM
X-VR-OUT-SCORE: 200
X-VR-OUT-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgeduiedrudeigdeikecutefuodetggdotefrodftvfcurfhrohhfihhlvgemucggtfgfnhhsuhgsshgtrhhisggvpdfftffgtefojffquffvnecuuegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenoggfufdqhfegudefqddvkedqshgvphhtucdlfedttddmnecujfgurhepfffhvffukfhfgggtuggjfgesthdtredttdervdenucfhrhhomheppfhitghoucghihhllhhirghmshcuoehnihgtohestghrhihpthhonhgvtghtohhrrdgtohhmqeenucggtffrrghtthgvrhhnpeeuffeuheelledtffdukeejiefgfeehgfffffehhfehjeeltedtvdegheejkeevjeenucffohhmrghinhepihgvthhfrdhorhhgnecukfhppedvgedrvdekrddutdekrddukeefnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmohguvgepshhmthhppdhhvghloheplhhotggrlhhhohhsthdpihhnvghtpedvgedrvdekrddutdekrddukeefpdhrvghtuhhrnhdqphgrthhhpefpihgtohcuhghilhhlihgrmhhsuceonhhitghosegtrhihphhtohhnvggtthhorhdrtghomheqpdhmrghilhhfrhhomhepnhhitghosegtrhihphhtohhnvggtthhorhdrtghomhdpnhhrtghpthhtohepnhhitghosegtrhihphhtohhnvggtthhorhdrtghomh
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/VLv7Z16GgKSgOFW0D7x6RHzL_0I>
Subject: Re: [I18ndir] I-D on filesystem I18N
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 08 Jul 2020 03:09:44 -0000

On Tue, Jul 07, 2020 at 07:16:01PM -0700, Asmus Freytag wrote:
> On 7/7/2020 6:29 PM, Nico Williams wrote:
> > On Tue, Jul 07, 2020 at 04:43:42PM -0700, Asmus Freytag wrote:
> > > On 7/6/2020 3:51 PM, Nico Williams wrote:

...

> We agree that forcing normalization using either NFKx can surprise users in
> a bad way (although for some subset of the equivalences, some subset of
> users might like form-insensitivity: full/half-width character variations
> are often treated very similar to case variation for East Asian users).

Yes.

> Should a file system allow a form-sensitive "no break hyphen" ? That one is
> worth discussing as a downside to the "just-8" type approach. It would be
> annoying for users having to guess the invisible no-break nature. (This
> would belong in some section on i18n pitfalls).

"just-use-8" is a reality of running code, not something I'm proposing.

String preparation on CREATE and LOOKUP, and form-insensitive/preserving
behaviors are both ways to deal with the "just-use-8" reality.

Mapping NON-BREAKING HYPHEN U+2011 to HYPHEN-MINUS U+002D is certainly
an option.

If ZFS and/or HFS+ and/or some other filesystems don't already do this
(I believe they don't), it will be difficult to make them start doing
it, since U+2011 is not considered equivalent to U+002D in any NF, and
changes like this generally can only be enabled in new filesystems.
However, it should be possible to make ZFS map it in _new_ filesystems
as an option (possibly enabled by default).

(Note: in ZFS terminology, a filesystem is known as a 'dataset'.)

That's not the only such mapping worth considering, of course.  There's
plenty of others, such as various whitespace and period characters.

> > > You really MUST not use the term "canonical" with NFKC and NFKD because for
> > > Unicode, the two forms NFC and NFD are considered "canonical" and that term
> > > is used contrastively to "compatibility".
> > The word 'canonical' was preceded by 'two' -- I think that's pretty
> > clear!  :)
>
> No. You MUST not use that term for any NFKx because the two NFKx are NOT
> "canonical" normalization forms as that term is defined in Unicode. Using
> Unicode defined terms in ways that conflicts with Unicode-defined
> terminology is a sure recipe for confusion.

My quick search in Unicode 13.0, chapters 2 and 3, finds examples that I
think support permitting my above formulation.  Still, if you insist,
I'll change that instance of 'canonical' to 'normalization'.

> > [...]
>
> Can't speak to performance issues - that's a separate one from the logical
> distinction better K and not-K.

I _suspect_ that Apple engineers chose NFD (or something very close to
NFD) because it would perform better than NFC.  I don't have it at hand,
but I seem to remember a blog post from almost twenty years ago about
that.

> > [...]
>
> I think NFD is a bad choice, because it rarely matches "raw" data. Most data
> is either unnormalized or (almost) in NFC.

Because ZFS is form-preserving, the choice of NFC or NFD here does not
affect anything other than performance and bits-on-disk.  In particular,
applications cannot tell the difference between a ZFS dataset that uses
NFC vs one that uses NFD.

For HFS+, however, NFD does have negative effects and it was a bad
choice.

So I agree with you, and in fact, that is the reason I pushed for
form-insensitive/preserving behavior in ZFS.

> > Apple's choice of NFD makes sense for performance reasons since it's
> > faster to normalize to form D than to form C (see above).  But
> > normalizing on CREATE to NFD conflicts with input modes that tend to
> > produce pre-composed character codepoints.  Form-insensitivity is the
> > obvious compromise.  Until the Git incident, I had no evidence of that
> > actually causing problems.
>
> In theory it may be faster, because C logically requires performing D first.

Right.

> But as most data is entered in C, all you need to do is verify that, which
> is faster than expanding a string to do D.

I'm not sure that's always easy to do, but I imagine it can be done.

However, Apple engineers may have thought NFD would perform better.  I
suspect so.

> > > However, early case-insensitive file systems did not preserve case. Not sure
> > > how rare this has become.
> 
> Windows is now case-preserving, but it's predecessor (DOS) was not.
> Or, more precisely, the difference is the file systems - FAT, vs FAT32, NTFS
> etc.

Ah yes, more evidence that in running code, I18N choices are made by the
_filesystems_.

> > So.. sentence-case?  I refer to that later:
> > 
> > |   The only way to implement I18N behaviors in the VFS layer rather than
> > |   at the filesystem is to abandon form- and case-preserving behaviors.
> > |   For case-insensitivity this would require using sentence-case, or all
> > |   lower-case, perhaps, and all such choices would surely be surprising
> > |   to users.  At any rate, that approach would also render much running
> > |   code "non-compliant" with any Internet filesystem protocol I18N
> > |   specification.
> 
> Why would you need to do sentence case?

I wouldn't.  That text is in a section discussing where to place I18N
behaviors.  I argue that placing them anywhere else than the filesystem
proper causes problems.  In particular, putting case-insensitivity
anywhere other than the filesystem means that in practice one cannot get
case-preserving behavior without caching entire directory listings at
whatever layer is implementing case-insensitivity.

> > 'here' == "this document".  I'll clarify.
> Will "this document" create a registry? (Haven't read far enough, and
> the "either" makes this a statement of intent rather than a
> description of what "this document" does).

I would rather NOT create a registry but use CLDR if it can be
appropriate.

I would also settle for no site- or locale-specific case-folding
tailorings.

And I might settle for letting clients fetch the set of tailorings.

There are choices to be made here.

> > > >     "just-use-8" or "just-use-16" (as in UTF-16 [UNICODE  <https://tools..ietf.org/html/draft-williams-filesystem-18n-00#ref-UNICODE>]), with no
> > > >      attempt at normalization or case folding done anywhere in between.
> > > In Unicode parlance: "just use strings of code units".
> > "just-use-8" means strings of 8-bit words that could be... any 8-bit
> > codeset, or UTF-8, or whatever.
> > 
> > "just-use-16" is almost certainly "as in UCS-2" or "as in UTF-16",
> > though for all I know there may be some non-Unicode 16-bit wchar_t
> > codesets in use as "just-use-16".
> 
> Unicode goes to great length to provide clear definitions of code points
> (things that are interpreted as assigned to some character) and code units
> (things that are the building blocks of sequences that are matched by UTF-8
> or UTF-16 to code points.
> 
> Check UTR#17 "Character Encoding Model".

Does Unicode have terminology for "a string of code units from an unkown
codeset and encoding"?

> This model works even for other character encodings, whether ISO 2022 or
> DBCS.

Right, but in the context in question there's no knowledge of the
codesets/encodings in use.  I believe it's important to make that clear.

> Now, if you want to use these less formal terms, that might be fine, but you
> should anchor them by relating them to the more formal definitions. "Just-8"
> treats filenames as strings of code units (uninterpreted bytes that at some
> other point are converted to code points that then can be interpreted as
> characters and subjected to transformations like case-folding and
> normalization).

I might add a glossary for them.  Good idea.

> "Just-16" is the same for the code units of UTF-16 (that being the only pure
> 16-bit format I can think of right now, but if others exist, the same would
> apply for them), but which could be big-endian or little-endian.
> Interestingly that distinction is on a level below the code unit; it belongs
> to the "serialized code unit".
> 
> So, what you are saying is that some parts of the architecture deals in raw
> strings of "serialized code units" (if that part is "in memory" only, then
> the distinction is moot and we are back at strings of code units. (See
> Section 5.1 in UTR#17).

Correct, this is in memory, but with no knowledge of the codeset/
encoding in use.

For Internet filesystem protocols, of course, we do and will insist on
Unicode on the wire.  My thesis is that dealing with normalization and
case insensitivity belongs in the filesystem, not in the server (as in
NFSv4 server).  Though it _also_ belongs in the client when it is a
_caching_ client.  (Not every NFSv4 client will cache directory
contents.)

> > > Specifically for UTF-8, this would imply that there are also no guarantees
> > > of well-formedness of the UTF-8 strings (likewise for surrogate pairs in
> > > UTF-16).
> > 
> > Well, yes.  But that wasn't what I was alluding to.  I was referring to,
> > e.g., using Linux (or *BSD, or Solaris/Illumos) in, say, an ISO-8859-1
> > locale.  In that case you might have an application all `creat(2)` with
> > a `pathname` that contains non-ASCII that is also not UTF-8.
> 
> Yes, that's implied in "8-bit code unit" - it's not something that's
> specific to UTF-8. You have both: raw string of code units that may
> represent UTF-8 (well-formed or not) or that may represent any other
> encoding scheme with 8-bit code units.

Right.  I really want to denote "8-bit code units, codeset/encoding
unknown".

> > The C library system call stub will *not* convert this to UTF-8 or
> > anything else, nor will the kernel-land system call know anything about
> > the locale in use in user-land.  The given `pathname` string will just
> > be a pointer to a zero-terminated array of `char` in user-land, and will
> > be copied into the kernel as such by the kernel-side of the system call,
> > then it will be passed to the VFS.
> > 
> > The VFS will interpret just two byte values specially: 0x00 (ASCII NULL)
> > and 0x2F (ASCII '/').  The filesystems will generally only interpret
> > 0x00 specially, though some (e.g., ZFS) may reject strings that are not
> > valid UTF-8.
> > 
> > We call this "just-use-8" in some circles (e.g., in the KITTEN WG).  I
> > don't recall who coined that.
> 
> It's a cute name, but it needs to be defined carefully in terms like those
> of the Character Encoding Model. We need to strive to remove confusion, not
> add to it.

That's fair.  I'll add a glossary.

Nico
--