[nfsv4] Early review of draft-dnoveck-nfsv4-internationalization-01

Nico Williams <nico@cryptonector.com> Tue, 02 June 2020 06:07 UTC

Return-Path: <nico@cryptonector.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4C19A3A07D6; Mon, 1 Jun 2020 23:07:21 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.1
X-Spam-Level:
X-Spam-Status: No, score=-2.1 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cryptonector.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id JM6XQBZbsM4t; Mon, 1 Jun 2020 23:07:18 -0700 (PDT)
Received: from bat.birch.relay.mailchannels.net (bat.birch.relay.mailchannels.net [23.83.209.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B61623A07D5; Mon, 1 Jun 2020 23:07:17 -0700 (PDT)
X-Sender-Id: dreamhost|x-authsender|nico@cryptonector.com
Received: from relay.mailchannels.net (localhost [127.0.0.1]) by relay.mailchannels.net (Postfix) with ESMTP id BE2F1400F69; Tue, 2 Jun 2020 06:07:16 +0000 (UTC)
Received: from pdx1-sub0-mail-a44.g.dreamhost.com (100-96-23-33.trex.outbound.svc.cluster.local [100.96.23.33]) (Authenticated sender: dreamhost) by relay.mailchannels.net (Postfix) with ESMTPA id 02077400F24; Tue, 2 Jun 2020 06:07:15 +0000 (UTC)
X-Sender-Id: dreamhost|x-authsender|nico@cryptonector.com
Received: from pdx1-sub0-mail-a44.g.dreamhost.com (pop.dreamhost.com [64.90.62.162]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384) by 0.0.0.0:2500 (trex/5.18.8); Tue, 02 Jun 2020 06:07:16 +0000
X-MC-Relay: Junk
X-MailChannels-SenderId: dreamhost|x-authsender|nico@cryptonector.com
X-MailChannels-Auth-Id: dreamhost
X-Squirrel-Attack: 42d3d5196c4f269c_1591078036455_3196787273
X-MC-Loop-Signature: 1591078036455:1965292150
X-MC-Ingress-Time: 1591078036455
Received: from pdx1-sub0-mail-a44.g.dreamhost.com (localhost [127.0.0.1]) by pdx1-sub0-mail-a44.g.dreamhost.com (Postfix) with ESMTP id AF3BE8010A; Mon, 1 Jun 2020 23:07:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=cryptonector.com; h=date :from:to:cc:subject:message-id:reply-to:mime-version :content-type; s=cryptonector.com; bh=e+Fseq1TkLttmfCq3znan8PUth 0=; b=dhOcUdvBiA3wRZWGW3RLocAoEspXJr96K6KKg868axRjm+ar7s37ro6Jix sfeb0iI4A/4nEHHY0pebub7SaMIdHNuo4/yI0yJcGPUEMI5972vWc/aT8Pngi3QS izJH2PJJOlXUeAV+Hp3RfTth4/Y0GIxU4P4dqWROHYiZ9xms8=
Received: from localhost (unknown [24.28.108.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: nico@cryptonector.com) by pdx1-sub0-mail-a44.g.dreamhost.com (Postfix) with ESMTPSA id CA2E38013B; Mon, 1 Jun 2020 23:07:14 -0700 (PDT)
Date: Tue, 02 Jun 2020 01:07:10 -0500
X-DH-BACKEND: pdx1-sub0-mail-a44
From: Nico Williams <nico@cryptonector.com>
To: i18ndir@ietf.org
Cc: davenoveck@gmail.com, nfsv4@ietf.org
Message-ID: <20200602060709.GQ18021@localhost>
Reply-To: i18ndir@ietf.org
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
User-Agent: Mutt/1.9.4 (2018-02-28)
X-VR-OUT-STATUS: OK
X-VR-OUT-SCORE: 0
X-VR-OUT-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgeduhedrudefiedguddttdcutefuodetggdotefrodftvfcurfhrohhfihhlvgemucggtfgfnhhsuhgsshgtrhhisggvpdfftffgtefojffquffvnecuuegrihhlohhuthemuceftddtnecunecujfgurhepfffhvffukfhrgggtuggfsehttdertddtredvnecuhfhrohhmpefpihgtohcuhghilhhlihgrmhhsuceonhhitghosegtrhihphhtohhnvggtthhorhdrtghomheqnecuggftrfgrthhtvghrnhepgfevleeuueegheeihefhieevffekgfejvddtieegueffieeivdekjeegfffgheefnecukfhppedvgedrvdekrddutdekrddukeefnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmohguvgepshhmthhppdhhvghloheplhhotggrlhhhohhsthdpihhnvghtpedvgedrvdekrddutdekrddukeefpdhrvghtuhhrnhdqphgrthhhpefpihgtohcuhghilhhlihgrmhhsuceonhhitghosegtrhihphhtohhnvggtthhorhdrtghomheqpdhmrghilhhfrhhomhepnhhitghosegtrhihphhtohhnvggtthhorhdrtghomhdpnhhrtghpthhtohepnhhitghosegtrhihphhtohhnvggtthhorhdrtghomh
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/aX8TaTjCCdusaKLrnp6DwWS5Xx8>
Subject: [nfsv4] Early review of draft-dnoveck-nfsv4-internationalization-01
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 02 Jun 2020 06:07:21 -0000

I have reviewed draft-dnoveck-nfsv4-internationalization.

In my opinion, this draft is extremely important to the Internet
community and beyond, and should progress.  This being an early review,
perhaps I should stop there.

However, there is an important, long-running, low-volume debate to
finally settle here, and it has to be settled in the I18N community.

The architectures and realities of the relevant operating systems makes
it impossible for us to practicably put the onus for I18N on the
filesystem _protocols_.  No, that onus can _only_ live in the
_filesystems_.  I cannot stress this enough.

If you stop reading here, you can take just the above paragraph with you
and consider it carefully.  If you continue reading, please forgive me
for the length of this post.

The document at hand is almost entirely dedicated to convincing the
present audience of the above premise and fact.  Most of the first ten
pages are non-normative text, and when it gets to what happens in
reality... it's essentially still informative rather than normative
text.  The I-D even modifies the meaning of RFC2119 so it can pretend to
be normative while not really being normative, all so it can continue
the fiction that I18N belongs in NFSv4 (and what about WebDAV? and SFTP?
and ...?) and not in the filesystem.

These assertions may cause friction.  Therefore I seek to convince you,
as the author tries as well, but I want to go further: I want to stop
pretending that the filesystem _protocol_ can be responsible for I18N.
Even if this viewpoint ends up on the rough side of consensus, the
running code can. not. change.  Anyone who wishes to argue that we can
only target the protocols and not the filesystems needs to consider this
fact.

The architecture of that running code has been as it is for many decades
-- almost as many decades as there has been an Internet community!

The author gets to the nub of it in section 3, which in pages 5 and 6
says (with marked elisions):

   During the period from the publication of RFC3010 [14] until now, two
   different perspectives with regard to internationalization have been
   held and represented, to varying degrees, in specifications for NFSv4
   minor versions.

   o  The perspective held by NFSv4 implementers treated most aspects of
      internationalization as basically outside the scope of what NFSv4
      client and server implementers could deal with.  This was because
      the POSIX interface treated filenames as uninterpreted strings of
      bytes, ...

   o  Within the IETF in general and in the IESG, there was a feeling
      that new protocols, such as NFSv4, could not avoid dealing with
      internationalization issues, ...

It has now come time to finally settle this debate, these 'different
perspectives'.

The essential detail that we cannot alter is the architecture of most
every general purpose operating system such as Unix, Unix-like
derivatives (e.g., BSD and derivatives), Unix-like non-derivatives
(e.g., Linux), and even Windows, as well as others.  Specifically:

 - there is a pluggable filesystem API -- the virtual filesystem
   switch (VFS);

 - filesystem protocol clients are plugins for the VFS;

 - filesystem protocol servers operate above the VFS;

 - the VFS API, and the SPI that plugins implement, are in the main
   I18N-unaware -- they are just-use-8 (BSD, Linux, Unix) or
   just-use-UTF-16 (I believe Win32 also leaves I18N to the filesystems,
   though I may be wrong about this);

 - the VFS and below are utterly unaware of the locale or even codeset
   used by application clients of that API.

Indeed, on Unix and Unix-like systems, the C library system call stubs,
the system calls themselves, and the entirety of the VFS, treat
filenames and paths as mostly-binary blobs with just two special byte
values: NUL (because these are C strings) and 0x2F (ASCII '/', because
it's the filesystem component separator as there is no array-of-
components representation of paths in the various system calls), and a
few special names in ASCII (e.g., ".", "..").

The kernel side of all of this is even less aware of user-level locale
selection (not. at. all.) than it is of user-level codeset selection
(NULL and / being special and ASCII, so only ASCII and superset codesets
need apply).

That this set of facts is common to such diverse operating systems
should be indicative of how natural this architecture is.  It's really
quite standard to have pluggable interfaces for this sort of
functionality, and it's not at all surprising that software
architectures the evolved in the 1980s didn't account for I18N.

To be sure, there are special-purpose fileservers, of course, and those
might not have a VFS -- who knows what they do.  But that hardly matters
because it suffices that we have decades-long history of VFS
architectures in widespread present use.  That is running code, much,
much running code.

The fact that filesystem protocol servers operate _above_ the VFS
essentially rules out implementation in, e.g., NFSv4 servers, of I18N
behaviors such as:

 - normalize on CREATE

   Sure, NFSv4 servers could, but what about POSIX and WIN32
   applications
   running on the same server?  What about other filesystem protocol
   servers on the same system?  They sure don't and won't, and we can't
   make them do it.

 - preserve form on CREATE and do form-insensitive matching on LOOKUP

   This could be implemented, but conflicts can't be avoided because...
   but what about POSIX and WIN32 applications running on the same
   server?  ... (Ditto.)

 - reject non-Unicode (non-UTF-8 in the case of NFSv4)

   Sure, NFSv4 servers could, but what about POSIX and WIN32
   applications running on the same server?  ... (Ditto.)

   Should NFSv4 servers filter out non-UTF-8 filenames in READDIR??

 - apply specific mappings in case-insensitive filesystems

   (Ditto.)

There's almost no major I18N best practice that an NFSv4 fileserver can
reliably implement on a general-purpose operating system!

Just about the only I18N best practice an NFSv4 fileserver can apply is
to refuse to CREATE new non-UTF-8 filenames.

So why should we have an I18N burden on NFSv4 at all?

If the above is not enough to convince the reader, then what about the
other Internet filesystem protocols, WebDAV and SFTP?

If multiple Internet filesystem protocols can (and they do) co-exist on
the same servers as NFSv4, sharing the same content, how can they have
different I18N requirements and recommendations?  The answer is obvious:
they can't.

And what about non-Internet filesystem protocols, such as:

 - Lustre
 - OpenAFS
 - Auristor
 - CIFS/SMB
 - ...

that also co-exist with Internet filesystem protocols?

We can't advise their designers and implementors, and we can't look to
them to learn from their I18N choices?  Well, we can't impose I18N
requirements on them, no, except by proxy via the Internet filesystem
protocols they also implement (or allow), but again, that just doesn't
work.

And that brings up third-party implementations of Internet filesystem
protocols on general-purpose operating systems.  Those can't possibly
force _our_ I18N values on the platform's native non-Internet filesystem
protocols.  E.g., an SFTP server on Windows co-existing with SMB.

What a mess, no?

But there is a saving grace.

There is one unifying thread: the VFS architecture.  That I18N-unaware
layer above the actual filesystems.  It turns out that this is the key
to the puzzle.

This blissful lack of awareness of I18N at the VFS layer means we can
push I18N all the way down to the filesystem and get good results.  Some
of us reached this conclusion almost twenty years ago, when ZFS
implemented I18N in the filesystem.  Even before that, engineers at
Apple seem to have reached similar conclusions.

In fact, all the problems of filesystem I18N are relatively easy to
address if we push them into the filesystem.  Yes, different filesystem
specifications and implementations may well make different I18N choices
-- they already do anyways, and we can't exactly force them to change.

There are only a few I18N problems to address in the filesystem.  I'll
focus here only on filenames (and pathnames).  We can describe them and
specify solutions as a BCP or even Standard and hopefully those
filesystems that don't yet implement any of these I18N behaviors can get
the hint and start doing so.  These problems are:

 - Unicode equivalence

   There are two approaches in the wild:

    - normalize on CREATE (and typically also LOOKUP)

      HFS+, for example does this.

      HFS+ normalizes to something close to NFD, while input methods
      generally produce sequences closer to NFC, at least for Latin
      scripts anyways.  Other filesystems could well go for NFC, which
      serves to illustrate that there is a variety of I18N behavior in
      the wild.

    - form-preserving on CREATE, form-insensitive on LOOKUP

      ZFS, for example, does this.  Again, diverse I18N behaviors in the
      wild.

   A third and unsatisfying approach is to do nothing.  Naturally we
   would not endorse that approach -- we might not even mention it.

 - Case mappings

   These are only relevant to case-insensitive filesystems.  It is not
   uncommon to have a single server sharing multiple different
   filesystems some of which are case-sensitive, and some of which are
   case-insensitive.

   Here the main problem is that there can be only a single set of
   mappings per-filesystem, and this set of mappings may vary by locale.
   Ergo, each case-sensitive filesystem needs to specify a locale or
   default to a sensible one.

   Note that knowing the locale of user application processes does not
   help here because it is just not possible to have different case
   mappings in the same case-insensitive filesystems for different
   users.

 - What to do about non-Unicode file names

   This is a matter of legacy.  We, the IETF, can say that Internet
   filesystem protocol servers MUST NOT allow the creation of new such
   names, but forbidding such names in the results of listing
   directories is harder.  We can even pretend legacy filesystem content
   does not exist.

   Still, there are only two sensible policies a filesystem might
   implement:

    - forbid non-Unicode;
    - allow non-Unicode, making no attempt to deal with equivalence.

A document that explains all of the above and correctly addresses I18N
requirements mainly at filesystems can be shorter than the document I
just reviewed, and can avoid the uncomfortable attempt at providing
alternate definitions of RFC2119 terms.  Let us do that.  I volunteer to
author or edit such a document if need be.

All that said, there is one way in which I18N does apply specifically to
NFSv4: in non-filename Unicode strings, such as the name@domain
representation of users and groups in access control lists (ACLs).
Fortunately there is no controversy about that, or the choices made in
NFSv4 regarding those, and nothing more need be said about that.

Nico
--