Re: [I18ndir] I-D on filesystem I18N

"Asmus Freytag (c)" <asmusf@ix.netcom.com> Wed, 08 July 2020 04:25 UTC

Return-Path: <asmusf@ix.netcom.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E70D93A07A1 for <i18ndir@ietfa.amsl.com>; Tue, 7 Jul 2020 21:25:43 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.996
X-Spam-Level:
X-Spam-Status: No, score=-1.996 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=ix.netcom.com; domainkeys=pass (2048-bit key) header.from=asmusf@ix.netcom.com header.d=ix.netcom.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id DRQAo60HpqS0 for <i18ndir@ietfa.amsl.com>; Tue, 7 Jul 2020 21:25:40 -0700 (PDT)
Received: from elasmtp-galgo.atl.sa.earthlink.net (elasmtp-galgo.atl.sa.earthlink.net [209.86.89.61]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 989AA3A07A0 for <i18ndir@ietf.org>; Tue, 7 Jul 2020 21:25:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ix.netcom.com; s=dk12062016; t=1594182340; bh=GxSObxKKotMyNSCoit+/elyLrwZG6dO+z+Vl 8SWT4bQ=; h=Received:Subject:To:Cc:References:From:Message-ID:Date: User-Agent:MIME-Version:In-Reply-To:Content-Type:Content-Language: X-ELNK-Trace:X-Originating-IP; b=q5jDDQyfq1EMU69PTJZsqB6m2YfhOX33G 8m7EJkLSEDNKS6v+S2hH5LF7xF6OFIsppEe8uUu/JeU5tMcDmWdJs2FTjbQWydGaVgN cGZqR+DIqek5gDwRIE6BZiVikeoG8x2HqivTF5C8vVGFyT62EeWws5YcmqJiQQ+vhg0 IePKSirl61gmWxp8mB6EMZ0ghC7/sNVa/BGkc7s1xXkHQYskyBYzX3UfanaAStwdLZ7 NmhbglwC5kauzojB3ahNR3rQVAFuG+4jM+s/KTdIFc9RKxjpsEgKOReMfySMt+78saF 8cE8Cxnkwy1/63IM/k3acCiYQBiqq09UhLNO+u/ZQ==
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dk12062016; d=ix.netcom.com; b=B2Dss1NuwPC7z3bQI4kgbpc8SMGuHW+gfb7s8czJriPQPd99YVYUw/NLv1llteN7aJHa1bCaBVwjjycJoS8MsoCmHoiK63qq/dK/zqCNoUx1qUlY1GiXr8hzkcYXiIDfvNG2yGJ564mvnJ6Cfr6DbqTdadZLwKm10Nzay4TPWe2yNgJRviC0GO6YP/coMdAiBAfj4b/+/S01yskmKXz36yNsG2JvJnWnYLWGOGT2tvU2OB56VuE3bR2IfR6hA1kwtaT8l4yG5C/jsf9nPGbXum8/IPzUkG9W5llc2kx3CSwUJcW+YzxAVuLFHfuR7Gxz9rpzAtDcYEXfBKRIY84bJA==; h=Received:Subject:To:Cc:References:From:Message-ID:Date:User-Agent:MIME-Version:In-Reply-To:Content-Type:Content-Language:X-ELNK-Trace:X-Originating-IP;
Received: from [71.212.59.17] (helo=[192.168.1.106]) by elasmtp-galgo.atl.sa.earthlink.net with esmtpa (Exim 4) (envelope-from <asmusf@ix.netcom.com>) id 1jt1eM-000G4U-JR; Wed, 08 Jul 2020 00:25:39 -0400
To: Nico Williams <nico@cryptonector.com>
Cc: i18ndir@ietf.org
References: <20200706225139.GJ3100@localhost> <90740541-ab72-ffaf-ff3e-5a27b5805eae@ix.netcom.com> <20200708012944.GY3100@localhost> <de2ff6a1-7437-96f1-3281-24b41522192b@ix.netcom.com> <20200708030932.GD3100@localhost>
From: "Asmus Freytag (c)" <asmusf@ix.netcom.com>
Message-ID: <b4cc2c06-7454-6766-84e3-b46459a2168d@ix.netcom.com>
Date: Tue, 07 Jul 2020 21:25:38 -0700
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0
MIME-Version: 1.0
In-Reply-To: <20200708030932.GD3100@localhost>
Content-Type: multipart/alternative; boundary="------------D8F72F8D050FEEA0DCCAF776"
Content-Language: en-US
X-ELNK-Trace: 464f085de979d7246f36dc87813833b22356fd30c7fd936ec11f6dc00931c072129dd643a491d9ea350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c
X-Originating-IP: 71.212.59.17
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/aXb9ncDnREyJxehdeYOolXeDNcM>
Subject: Re: [I18ndir] I-D on filesystem I18N
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 08 Jul 2020 04:25:44 -0000

On 7/7/2020 8:09 PM, Nico Williams wrote:
> On Tue, Jul 07, 2020 at 07:16:01PM -0700, Asmus Freytag wrote:
>> On 7/7/2020 6:29 PM, Nico Williams wrote:
>>> On Tue, Jul 07, 2020 at 04:43:42PM -0700, Asmus Freytag wrote:
>>>> On 7/6/2020 3:51 PM, Nico Williams wrote:
> ...
>
>> We agree that forcing normalization using either NFKx can surprise users in
>> a bad way (although for some subset of the equivalences, some subset of
>> users might like form-insensitivity: full/half-width character variations
>> are often treated very similar to case variation for East Asian users).
> Yes.
(we agree on a lot - even if I heckle you on some wording issues :) )
>
>> Should a file system allow a form-sensitive "no break hyphen" ? That one is
>> worth discussing as a downside to the "just-8" type approach. It would be
>> annoying for users having to guess the invisible no-break nature. (This
>> would belong in some section on i18n pitfalls).
> "just-use-8" is a reality of running code, not something I'm proposing.
>
> String preparation on CREATE and LOOKUP, and form-insensitive/preserving
> behaviors are both ways to deal with the "just-use-8" reality.
>
> Mapping NON-BREAKING HYPHEN U+2011 to HYPHEN-MINUS U+002D is certainly
> an option.
>
> If ZFS and/or HFS+ and/or some other filesystems don't already do this
> (I believe they don't), it will be difficult to make them start doing
> it, since U+2011 is not considered equivalent to U+002D in any NF, and
> changes like this generally can only be enabled in new filesystems.
> However, it should be possible to make ZFS map it in _new_ filesystems
> as an option (possibly enabled by default).
>
> (Note: in ZFS terminology, a filesystem is known as a 'dataset'.)
>
> That's not the only such mapping worth considering, of course.  There's
> plenty of others, such as various whitespace and period characters.

I get you on running code - but the approach of pointing out small but 
crucial improvements like that is probably worthwhile - if 
interoperation with other filesystems isn't made worse. Some of these 
are so broken, like allowing U+2011, that it would be interesting to 
know if anyone ever created two filenames that relied on that distinction.

I need to read the remainder of your I-D - but probably not tonight, and 
not only because I'm still discussing this batch of feedback.


>
>>>> You really MUST not use the term "canonical" with NFKC and NFKD because for
>>>> Unicode, the two forms NFC and NFD are considered "canonical" and that term
>>>> is used contrastively to "compatibility".
>>> The word 'canonical' was preceded by 'two' -- I think that's pretty
>>> clear!  :)
>> No. You MUST not use that term for any NFKx because the two NFKx are NOT
>> "canonical" normalization forms as that term is defined in Unicode. Using
>> Unicode defined terms in ways that conflicts with Unicode-defined
>> terminology is a sure recipe for confusion.
> My quick search in Unicode 13.0, chapters 2 and 3, finds examples that I
> think support permitting my above formulation.  Still, if you insist,
> I'll change that instance of 'canonical' to 'normalization'.

The word "canonical equivalence" and "canonical de/composition" are used 
contrastively to "compatibility equivalence" and "compatibility 
decomposition". Which is why NFC/NFD are commonly known as "canonical 
normalization" even if that term is not defined in that way.

NFKC and NFKD are, by the way, in my opinion design mistakes. Instead, 
there should have been a series of compatibility foldings. Water under 
the bridge.

>
>>> [...]
>> Can't speak to performance issues - that's a separate one from the logical
>> distinction better K and not-K.
> I _suspect_ that Apple engineers chose NFD (or something very close to
> NFD) because it would perform better than NFC.  I don't have it at hand,
> but I seem to remember a blog post from almost twenty years ago about
> that.

Probably an entirely different reasoning:

For a while, it looked like additional composed characters could be 
added to Unicode, which would have made only NFD stable. That got fixed.

Also, NFD appeals to purists, because all diacritical marks are treated 
the same way: as explicit characters.

>
>>> [...]
>> I think NFD is a bad choice, because it rarely matches "raw" data. Most data
>> is either unnormalized or (almost) in NFC.
> Because ZFS is form-preserving, the choice of NFC or NFD here does not
> affect anything other than performance and bits-on-disk.  In particular,
> applications cannot tell the difference between a ZFS dataset that uses
> NFC vs one that uses NFD.
>
> For HFS+, however, NFD does have negative effects and it was a bad
> choice.
>
> So I agree with you, and in fact, that is the reason I pushed for
> form-insensitive/preserving behavior in ZFS.
>
>>> Apple's choice of NFD makes sense for performance reasons since it's
>>> faster to normalize to form D than to form C (see above).  But
>>> normalizing on CREATE to NFD conflicts with input modes that tend to
>>> produce pre-composed character codepoints.  Form-insensitivity is the
>>> obvious compromise.  Until the Git incident, I had no evidence of that
>>> actually causing problems.
>> In theory it may be faster, because C logically requires performing D first.
> Right.
>
>> But as most data is entered in C, all you need to do is verify that, which
>> is faster than expanding a string to do D.
> I'm not sure that's always easy to do, but I imagine it can be done.
Yes, there are even data tables that allow you to do that "quick check".

For the Latin script, the number of characters that require a sequence 
to encode in NFC is amazingly small (and required by some African 
writing systems). Chances are, therefore, that 100% of your filenames 
are in NFC in a form-preserving system (unless you have 
machine-generated random file names including diacritics).

Some South Asian scripts tend to have sequences that are unnormalized, 
rather than NFC or NFD. Often because diacritics are ordered the way 
fonts like them best. For those, both NFD and NFC take the hit from 
reordering (the number of 'compositions' tends to be small-ish, so the 
performance penalty for NFC cannot be that big).
>
> However, Apple engineers may have thought NFD would perform better.  I
> suspect so.
>
>>>> However, early case-insensitive file systems did not preserve case. Not sure
>>>> how rare this has become.
>> Windows is now case-preserving, but it's predecessor (DOS) was not.
>> Or, more precisely, the difference is the file systems - FAT, vs FAT32, NTFS
>> etc.
> Ah yes, more evidence that in running code, I18N choices are made by the
> _filesystems_.
I think you've mad that point, and well.
>
>>> So.. sentence-case?  I refer to that later:
>>>
>>> |   The only way to implement I18N behaviors in the VFS layer rather than
>>> |   at the filesystem is to abandon form- and case-preserving behaviors.
>>> |   For case-insensitivity this would require using sentence-case, or all
>>> |   lower-case, perhaps, and all such choices would surely be surprising
>>> |   to users.  At any rate, that approach would also render much running
>>> |   code "non-compliant" with any Internet filesystem protocol I18N
>>> |   specification.
>> Why would you need to do sentence case?
> I wouldn't.  That text is in a section discussing where to place I18N
> behaviors.  I argue that placing them anywhere else than the filesystem
> proper causes problems.  In particular, putting case-insensitivity
> anywhere other than the filesystem means that in practice one cannot get
> case-preserving behavior without caching entire directory listings at
> whatever layer is implementing case-insensitivity.
Right, I think I've seen some apps present windows filenames using 
sentence casing and it's driven me bananas. Even if it doesn't matter in 
terms of accessing the file.
>
>>> 'here' == "this document".  I'll clarify.
>> Will "this document" create a registry? (Haven't read far enough, and
>> the "either" makes this a statement of intent rather than a
>> description of what "this document" does).
> I would rather NOT create a registry but use CLDR if it can be
> appropriate.
>
> I would also settle for no site- or locale-specific case-folding
> tailorings.
>
> And I might settle for letting clients fetch the set of tailorings.
>
> There are choices to be made here.

I'm not that deep into your document yet to know your details.

>
>>>>>      "just-use-8" or "just-use-16" (as in UTF-16 [UNICODE  <https://tools..ietf.org/html/draft-williams-filesystem-18n-00#ref-UNICODE>]), with no
>>>>>       attempt at normalization or case folding done anywhere in between.
>>>> In Unicode parlance: "just use strings of code units".
>>> "just-use-8" means strings of 8-bit words that could be... any 8-bit
>>> codeset, or UTF-8, or whatever.
>>>
>>> "just-use-16" is almost certainly "as in UCS-2" or "as in UTF-16",
>>> though for all I know there may be some non-Unicode 16-bit wchar_t
>>> codesets in use as "just-use-16".
>> Unicode goes to great length to provide clear definitions of code points
>> (things that are interpreted as assigned to some character) and code units
>> (things that are the building blocks of sequences that are matched by UTF-8
>> or UTF-16 to code points.
>>
>> Check UTR#17 "Character Encoding Model".
> Does Unicode have terminology for "a string of code units from an unkown
> codeset and encoding"?

A "code unit" is, at first, just that. A UTF-8 code unit, would be 
something more specific. Go have a look at UTR#17 - it was meant to give 
a model that would cover not only Unicode. But what is wrong about using 
your quoted phrase in a sentence like:

     We will use the term "just-8"  for string of 8-bit code units from 
an unknown codeset and encoding - with the proviso that the xxxx system 
can recognise "/", "\" and "\0"

(or whatever else it needs to recognize).

Such a sentence, corrected for what systems/protocols do need to know, 
would nicely link to the Unicode terminology.

>
>> This model works even for other character encodings, whether ISO 2022 or
>> DBCS.
> Right, but in the context in question there's no knowledge of the
> codesets/encodings in use.  I believe it's important to make that clear.
No argument.
>
>> Now, if you want to use these less formal terms, that might be fine, but you
>> should anchor them by relating them to the more formal definitions. "Just-8"
>> treats filenames as strings of code units (uninterpreted bytes that at some
>> other point are converted to code points that then can be interpreted as
>> characters and subjected to transformations like case-folding and
>> normalization).
> I might add a glossary for them.  Good idea.
glossaries are always good - but having glosses at first mention is 
important.
>
>> "Just-16" is the same for the code units of UTF-16 (that being the only pure
>> 16-bit format I can think of right now, but if others exist, the same would
>> apply for them), but which could be big-endian or little-endian.
>> Interestingly that distinction is on a level below the code unit; it belongs
>> to the "serialized code unit".
>>
>> So, what you are saying is that some parts of the architecture deals in raw
>> strings of "serialized code units" (if that part is "in memory" only, then
>> the distinction is moot and we are back at strings of code units. (See
>> Section 5.1 in UTR#17).
> Correct, this is in memory, but with no knowledge of the codeset/
> encoding in use.
>
> For Internet filesystem protocols, of course, we do and will insist on
> Unicode on the wire.  My thesis is that dealing with normalization and
> case insensitivity belongs in the filesystem, not in the server (as in
> NFSv4 server).  Though it _also_ belongs in the client when it is a
> _caching_ client.  (Not every NFSv4 client will cache directory
> contents.)

you know those ramifications better. If byte order is something to worry 
about, you need to address ether just-16 is 16-bit code units or 
"serialized" 16-bit code units.


>
>>>> Specifically for UTF-8, this would imply that there are also no guarantees
>>>> of well-formedness of the UTF-8 strings (likewise for surrogate pairs in
>>>> UTF-16).
>>> Well, yes.  But that wasn't what I was alluding to.  I was referring to,
>>> e.g., using Linux (or *BSD, or Solaris/Illumos) in, say, an ISO-8859-1
>>> locale.  In that case you might have an application all `creat(2)` with
>>> a `pathname` that contains non-ASCII that is also not UTF-8.
>> Yes, that's implied in "8-bit code unit" - it's not something that's
>> specific to UTF-8. You have both: raw string of code units that may
>> represent UTF-8 (well-formed or not) or that may represent any other
>> encoding scheme with 8-bit code units.
> Right.  I really want to denote "8-bit code units, codeset/encoding
> unknown".

Code unit is defined in the Unicode Standard (glossary) as:

    /Code Unit <https://unicode.org/glossary/#code_unit>/. The minimal
    bit combination that can represent a unit of encoded text for
    processing or interchange. The Unicode Standard uses 8-bit code
    units in the UTF-8 encoding form, 16-bit code units in the UTF-16
    encoding form, and 32-bit code units in the UTF-32 encoding form.
    (See definition D77 in Section 3.9, Unicode Encoding Forms
    <http://www.unicode.org/versions/latest/ch03.pdf#G7404>.)

If you accept that this definition is meant more generically (and it is 
used in UTR#17 that way) then your definition of filenames is covered by 
"code unit sequence" for an "unspecified/unknown" character encoding.

And "just-8" is a code unit sequence of 8-bit code units for an unknown 
character encoding. Also, if you use the defined term "character 
encoding" by reference then you don't have to use alternate 
constructions like codeset/encoding.

Unicode defines terms that were needed because of the introduction of 
wide character code units:

  * CES (character encoding scheme) is the mapping from code units to
    bytes (think: serialization). Trivial for UTF-8 or ASCII, not so for
    UTF-16.
  * CEF (character encoding form) is the mapping from code points to
    code units. Again, trivial for ASCII, but in Unicode you get all the
    UTFs.
  * And "character encoding" is the umbrella term that says how the code
    points relate to characters (i.e. Unicode, JIS, whatever).

The need for all those definitions was that nobody, prior to Unicode, 
had to deal with CEFs and the serialization to bytes was more or less 
"implicit". By all means, if someone created a parallel set of IETF 
definitions for the same concepts, use those, but unless there's an RFC 
that covers the same grounds as exhaustively, I'd recommend citing 
UTR#17. (If necessary, you use the glossary to relate to other industry 
terms).

>>> The C library system call stub will *not* convert this to UTF-8 or
>>> anything else, nor will the kernel-land system call know anything about
>>> the locale in use in user-land.  The given `pathname` string will just
>>> be a pointer to a zero-terminated array of `char` in user-land, and will
>>> be copied into the kernel as such by the kernel-side of the system call,
>>> then it will be passed to the VFS.
>>>
>>> The VFS will interpret just two byte values specially: 0x00 (ASCII NULL)
>>> and 0x2F (ASCII '/').  The filesystems will generally only interpret
>>> 0x00 specially, though some (e.g., ZFS) may reject strings that are not
>>> valid UTF-8.
>>>
>>> We call this "just-use-8" in some circles (e.g., in the KITTEN WG).  I
>>> don't recall who coined that.
>> It's a cute name, but it needs to be defined carefully in terms like those
>> of the Character Encoding Model. We need to strive to remove confusion, not
>> add to it.
> That's fair.  I'll add a glossary.
>
> Nico