Re: [I18ndir] I-D on filesystem I18N

Asmus Freytag <asmusf@ix.netcom.com> Wed, 08 July 2020 02:16 UTC

DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dk12062016; d=ix.netcom.com; b=iFAsHxIW7pok7mG5Tz2HE37ztqtjf4labBJDKd2ZLLxkp4IuAlucvx0Fom+324cfAvOm5/Z5SEuoQl1nkHotZjYtlvwFOn+uy468GF8CYYcS4rJzww+MwqmyQsJ25gStIlWiQZVVZjZZiN6eqzU16ZE87NwA/xNV3k1knLoVN6ZWUDU8XqusDDJLUt6+7KCLQ03X14AlNVcdt0x5aYDyr8kgXum0gJPZkUn+NR9VcqvkAVlncOHGjsbBLDBZleZjxOsXRJ7tLA9K6nExpkKFoZdQzxnY7dq5A4IFNnpu+laLnGnH1NLfTBspOEzmhws42W2hQX3HeyGhhV0d4KZPlw==; h=Received:Subject:To:References:From:Message-ID:Date:User-Agent:MIME-Version:In-Reply-To:Content-Type:Content-Language:X-ELNK-Trace:X-Originating-IP;
To: i18ndir@ietf.org
References: <20200706225139.GJ3100@localhost> <90740541-ab72-ffaf-ff3e-5a27b5805eae@ix.netcom.com> <20200708012944.GY3100@localhost>
From: Asmus Freytag <asmusf@ix.netcom.com>
Message-ID: <de2ff6a1-7437-96f1-3281-24b41522192b@ix.netcom.com>
Date: Tue, 07 Jul 2020 19:16:01 -0700
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0
MIME-Version: 1.0
In-Reply-To: <20200708012944.GY3100@localhost>
Content-Type: multipart/alternative; boundary="------------6C5AC9C56B3678E9B679DFCF"
Content-Language: en-US
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/sDA1fnN7MqrVYtey-W3LOGG70mA>
Subject: Re: [I18ndir] I-D on filesystem I18N
Precedence: list

On 7/7/2020 6:29 PM, Nico Williams wrote:
> On Tue, Jul 07, 2020 at 04:43:42PM -0700, Asmus Freytag wrote:
>> On 7/6/2020 3:51 PM, Nico Williams wrote:
>>> I've submitted draft-williams-filesystem-18n-00.
>> Here are my comments:
> Thanks!
>
>>>      This document describes requirements for internationalization (I18N)
>>>      of filesystems specifically in the context of Internet protocols, the
>>>      architecture for filesystems in most currently popular general
>>>      purpose operating systems, and their implications for filesystem
>>>      I18N.  From the I18N requirements for filesystems and the
>> The first sentence doesn't scan - the constructions joined by "and"
>> seemingly are not parallel enough.
> I think the "their" in "and their implications" is unclear as to
> antecedent.  I'll wordsmith it.
>
>>>      [TBD: Add references galore.  How to reference Unicode?  How to
>> If you go on the Unicode site and look for the page on the latest version
>> (currently https://www.unicode.org/versions/Unicode13.0.0/), you can
>> navigate to suggested ways to reference. I would cite the latest version as
>> of the time of drafting of the I-D and then also write: "the latest version
>> is available at" and give the URL as http:\\www.unicode.org\versions\latest.
> The question was about bibxml specifically, though I should have made
> that clear.  I failed to update that comment -- I did figure out how to
> reference Unicode, and did reference it.
>
> I do still need to add a number of other references.  And for the two
> books I referenced I should add section numbers.
>
>> If you need to reference specifically the properties, the character database
>> is at https://www.unicode.org/Public/13.0.0/ or
>> https://www.unicode.org/Public/UCD/latest/ -- You could also link to UAX#44,
>> which is the overview of the Unicode Character Database. The "latest" is at:
>> https://www.unicode.org/reports/tr44/ and the current one, as of today, is
>> at https://www.unicode.org/reports/tr44/tr44-26.html (see the "This version"
>> link for the latest).
> In XML terms... you're saying I should use <eref>, I think.
>
>>>    To deal with the equivalence problem, Unicode defines Normal Forms
>> Unicode calls these "Normalization Forms", see Section 3.11 in TUS (The
>> Unicode Standard) so that is what should be used when capitalized. Ditto for
>> the formal names for NFC and NFD.
>>
>> I think you need to mention already here that NFC/NFD represent a semantic
>> identity (and generally identical appearance) while NFKC and NFCD abstract
>> away rather noticeable differences in appearance that may in some cases
>> imply strong semantic differences to some users (e.g. math alphabets).
> I chose to do so one page later because I wanted to keep the intro
> brief:
>
> |   Unicode compatibility equivalence allows equivalence between
> |   different representations of the same abstract character that may
> |   nonetheless have different visual appearance of behavior.  There are
> |   two canonical forms that support compatibility equivalence: NFKC and
> |   NFKD.  Using NoCL with NFKC or NFKD may be surprising to users in a
> |   visual way.  While form-insensitivity with NFKC or NFKD may surprise
> |   users who might consider two file names distinct even when Unicode
> |   considers them equivalent under compatibility equivalence.  The
> |   latter seems less likely and less surprising, though that is an
> |   entirely subjective judgement.
>
> Here "NoCL" == normalize-on-CREATE-and-LOOKUP, defined just above the
> quoted text.
>
> Does that work for you?
The text continues to have the same problems I mentioned. While it also 
has a sentence that start with while and does not begin with a 
subordinate clause.

We agree that forcing normalization using either NFKx can surprise users 
in a bad way (although for some subset of the equivalences, some subset 
of users might like form-insensitivity: full/half-width character 
variations are often treated very similar to case variation for East 
Asian users).

Should a file system allow a form-sensitive "no break hyphen" ? That one 
is worth discussing as a downside to the "just-8" type approach. It 
would be annoying for users having to guess the invisible no-break 
nature. (This would belong in some section on i18n pitfalls).

>
>>>    Unicode compatibility equivalence allows equivalence between
>>>      different representations of the same abstract character that may
>>>      nonetheless have different visual appearance of behavior.  There are
>>>      two canonical forms that support compatibility equivalence: NFKC and
>          ^^^^^^^^^^^^^
>>>      NFKD.  Using NoCL with NFKC or NFKD may be surprising to users in a
>>>      visual way.  While form-insensitivity with NFKC or NFKD may surprise
>>>      users who might consider two file names distinct even when Unicode
>>>      considers them equivalent under compatibility equivalence.  The
>>>      latter seems less likely and less surprising, though that is an
>>>      entirely subjective judgement.
>> You really MUST not use the term "canonical" with NFKC and NFKD because for
>> Unicode, the two forms NFC and NFD are considered "canonical" and that term
>> is used contrastively to "compatibility".
> The word 'canonical' was preceded by 'two' -- I think that's pretty
> clear!  :)
No. You MUST not use that term for any NFKx because the two NFKx are NOT 
"canonical" normalization forms as that term is defined in Unicode. 
Using Unicode defined terms in ways that conflicts with Unicode-defined 
terminology is a sure recipe for confusion.
>> There's an "of" that should be an "or", but it's not just behavior - it's
>> also meaning. Mathematicians would object strenuously to having their math
>> alphabets "normalized" to standard A-Z.
> Ay, yes, thanks!
>
>> NFK(C/D) is useful in a different way: if it used to disallow code points
>> that aren't stable under these normalization forms, then one sidesteps the
>> issue of whether the distinct appearance etc. is meaningful, but without
>> ever changing a filename (which would happen if it was normalized when
>> stored). (There may be no file-systems that take this approach, nevertheless
>> it's worth discussing as it is used in other naming schemes).
> ZFS (in Solaris and most ports, but not OS X) form-insensitivity, and
> lets the user choose a form for use for this.
>
> See https://docs.oracle.com/cd/E71909_01/html/E71919/gpssl.html which
> shows that NFC, NFD, NFKC, and NFKC.
>
> It doesn't make much sense to offer NFC or NFKC for this given that the
> behavior is form-insensitive and form-preserving (NFC is specified as
> requiring canonical decomposition as a first step, which means that NFC
> is very likely slower than NFD, and the same applies to NFKC and NFKD).
> But whatever.
Can't speak to performance issues - that's a separate one from the logical
distinction better K and not-K.
>
> In at least the original port to OS X, before Apple abandoned it, ZFS
> was made to normalize-on-CREATE (and LOOKUP) instead of being
> form-preserving.  This was probably done to match HFS+'s behavior.  I
> remember a post by Linus Torvalds complaining bitterly about this and
> blaming ZFS in general rather than Apple, when some user complained that
> Git did not handle this case correctly.  The issue was that Git looked
> for the file using the name it knew from the repository, and that was in
> NFC because of course the input mode used when the file was committed
> produced something NFC-ish, but on HFS+ this got converted to NFD, and
> Git then failed to find the file by memcmp() in the directory listing.
> Ironically, that very possibility had been my motivation for pushing for
> form-insensitive and form-preserving behavior in ZFS in Solaris to begin
> with, but Apple engineers insisted on the HFS+ behavior in their port.
I think NFD is a bad choice, because it rarely matches "raw" data. Most 
data is either unnormalized or (almost) in NFC.
>
> Apple's choice of NFD makes sense for performance reasons since it's
> faster to normalize to form D than to form C (see above).  But
> normalizing on CREATE to NFD conflicts with input modes that tend to
> produce pre-composed character codepoints.  Form-insensitivity is the
> obvious compromise.  Until the Git incident, I had no evidence of that
> actually causing problems.
In theory it may be faster, because C logically requires performing D 
first. But as most data is entered in C, all you need to do is verify 
that, which is faster than expanding a string to do D.
>
> Git nowadays contains normalization code to deal with this, whereas it
> could have not had any if we'd had consensus on form-insensitivity a few
> years before we added it to ZFS.  Bummer.  Still, form-insensitivity is
> clearly the better behavior, though I'm not proposing that we mandate
> it.
>
>>>      foldings are defined by Unicode.  Generally, case-insensitive
>>>      filesystems preserve original case just form-insensitive filesystems
>>>      preserve original form.
>> There's an "as" missing.
> Thanks :)
>
>> However, early case-insensitive file systems did not preserve case. Not sure
>> how rare this has become.

Windows is now case-preserving, but it's predecessor (DOS) was not.
Or, more precisely, the difference is the file systems - FAT, vs FAT32, 
NTFS etc.

> So.. sentence-case?  I refer to that later:
>
> |   The only way to implement I18N behaviors in the VFS layer rather than
> |   at the filesystem is to abandon form- and case-preserving behaviors.
> |   For case-insensitivity this would require using sentence-case, or all
> |   lower-case, perhaps, and all such choices would surely be surprising
> |   to users.  At any rate, that approach would also render much running
> |   code "non-compliant" with any Internet filesystem protocol I18N
> |   specification.

Why would you need to do sentence case?


>
>>>      listings that work the same way as on the server.  We do not specify
>>>      any case foldings here.  Instead we will either create a registry of
>> the "here" is unclear. Does it refer to recommendation in this ID relevant
>> to caching clients? If so, link to section.
> 'here' == "this document".  I'll clarify.
Will "this document" create a registry? (Haven't read far enough, and 
the "either"
makes this a statement of intent rather than a description of what "this 
document"
does).
>
>>>     "just-use-8" or "just-use-16" (as in UTF-16 [UNICODE  <https://tools..ietf.org/html/draft-williams-filesystem-18n-00#ref-UNICODE>]), with no
>>>      attempt at normalization or case folding done anywhere in between.
>> In Unicode parlance: "just use strings of code units".
> "just-use-8" means strings of 8-bit words that could be... any 8-bit
> codeset, or UTF-8, or whatever.
>
> "just-use-16" is almost certainly "as in UCS-2" or "as in UTF-16",
> though for all I know there may be some non-Unicode 16-bit wchar_t
> codesets in use as "just-use-16".

Unicode goes to great length to provide clear definitions of code points 
(things that are interpreted as assigned to some character) and code 
units (things that are the building blocks of sequences that are matched 
by UTF-8 or UTF-16 to code points.

Check UTR#17 "Character Encoding Model".

This model works even for other character encodings, whether ISO 2022 or 
DBCS.

Now, if you want to use these less formal terms, that might be fine, but 
you should anchor them by relating them to the more formal definitions. 
"Just-8" treats filenames as strings of code units (uninterpreted bytes 
that at some other point are converted to code points that then can be 
interpreted as characters and subjected to transformations like 
case-folding and normalization).

"Just-16" is the same for the code units of UTF-16 (that being the only 
pure 16-bit format I can think of right now, but if others exist, the 
same would apply for them), but which could be big-endian or 
little-endian. Interestingly that distinction is on a level below the 
code unit; it belongs to the "serialized code unit".

So, what you are saying is that some parts of the architecture deals in 
raw strings of "serialized code units" (if that part is "in memory" 
only, then the distinction is moot and we are back at strings of code 
units. (See Section 5.1 in UTR#17).


>
>> Specifically for UTF-8, this would imply that there are also no guarantees
>> of well-formedness of the UTF-8 strings (likewise for surrogate pairs in
>> UTF-16).
> Well, yes.  But that wasn't what I was alluding to.  I was referring to,
> e.g., using Linux (or *BSD, or Solaris/Illumos) in, say, an ISO-8859-1
> locale.  In that case you might have an application all `creat(2)` with
> a `pathname` that contains non-ASCII that is also not UTF-8.

Yes, that's implied in "8-bit code unit" - it's not something that's 
specific to UTF-8. You have both: raw string of code units that may 
represent UTF-8 (well-formed or not) or that may represent any other 
encoding scheme with 8-bit code units.

>
> The C library system call stub will *not* convert this to UTF-8 or
> anything else, nor will the kernel-land system call know anything about
> the locale in use in user-land.  The given `pathname` string will just
> be a pointer to a zero-terminated array of `char` in user-land, and will
> be copied into the kernel as such by the kernel-side of the system call,
> then it will be passed to the VFS.
>
> The VFS will interpret just two byte values specially: 0x00 (ASCII NULL)
> and 0x2F (ASCII '/').  The filesystems will generally only interpret
> 0x00 specially, though some (e.g., ZFS) may reject strings that are not
> valid UTF-8.
>
> We call this "just-use-8" in some circles (e.g., in the KITTEN WG).  I
> don't recall who coined that.

It's a cute name, but it needs to be defined carefully in terms like 
those of the Character Encoding Model. We need to strive to remove 
confusion, not add to it.

> I expect something similar happens in Windows for both, the *A()
> (just-use-8?) and *W() functions (just-use-16?).  But I'm not
> sufficiently experienced a Windows programmer to really know.
>
> "just-use-16" is my coinage as an obvious variation of "just-use-8".
>
>> -- reached end of section 1 and out of time slot --
> Thanks so much!
>
> Nico

[I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N Patrik Fältström
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N John C Klensin
Re: [I18ndir] I-D on filesystem I18N Patrik Fältström
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N John C Klensin
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N John C Klensin
[I18ndir] Do we need an I18N WG? (Re: I-D on file… Nico Williams
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N John C Klensin
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N Asmus Freytag
Re: [I18ndir] I-D on filesystem I18N John Levine
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N John C Klensin
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N Asmus Freytag
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N Asmus Freytag (c)
Re: [I18ndir] I-D on filesystem I18N John R Levine
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N Asmus Freytag (c)
Re: [I18ndir] I-D on filesystem I18N Asmus Freytag (c)
Re: [I18ndir] I-D on filesystem I18N Patrik Fältström
Re: [I18ndir] I-D on filesystem I18N Patrik Fältström
Re: [I18ndir] I-D on filesystem I18N John C Klensin
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N Patrik Fältström
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N Patrik Fältström
Re: [I18ndir] I-D on filesystem I18N Asmus Freytag
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N John R Levine
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N John R Levine
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N John Levine
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N Nico Williams
Re: [I18ndir] I-D on filesystem I18N Nico Williams