Re: [apps-discuss] I-D Action: draft-ietf-appsawg-file-scheme-09.txt

Mark Nottingham <> Fri, 20 May 2016 08:55 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id C908912D599 for <>; Fri, 20 May 2016 01:55:24 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -2.602
X-Spam-Status: No, score=-2.602 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id DYYwPj-zwnBu for <>; Fri, 20 May 2016 01:55:20 -0700 (PDT)
Received: from ( []) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 568D612D0B0 for <>; Fri, 20 May 2016 01:55:20 -0700 (PDT)
Received: from [] (unknown []) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPSA id 9F50522E25B; Fri, 20 May 2016 04:55:17 -0400 (EDT)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
From: Mark Nottingham <>
In-Reply-To: <>
Date: Fri, 20 May 2016 18:55:15 +1000
Content-Transfer-Encoding: quoted-printable
Message-Id: <>
References: <> <> <> <> <>
To: Matthew Kerwin <>
X-Mailer: Apple Mail (2.3124)
Archived-At: <>
Cc: IETF Apps Discuss <>
Subject: Re: [apps-discuss] I-D Action: draft-ietf-appsawg-file-scheme-09.txt
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: General discussion of application-layer protocols <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Fri, 20 May 2016 08:55:25 -0000

> On 20 May 2016, at 4:37 PM, Matthew Kerwin <> wrote:
>> >    Without other encoding information, percent-encoded octets in a file
>> >    URI ([RFC3986], Section 2.1) MAY be interpreted according to the
>> >    preferred or configured encoding of the system on which the URI is
>> >    being interpreted.
>> Do the current implementations of file:// do this -- i.e., use the filesystem's encoding for the URI?
> ​Apparently. I don't have a spare drive lying around where I can reformat a partition to test it for myself, though. A discussion I had with Dave Thaler back at the very start of this draft revolved around the fact that percent-encoded URIs are ambiguous (apparently a real issue for Windows), which was why for a very long time the draft contained advice to use an IRI​ instead, or at the least normalize.

VMs are good for testing.

It appears that both Windows and OSX have used UTF-8 for file name encoding for some time (since NT for the former, 10.0 for the latter, AIUI). See: <>

Linux uses whatever locale is set. However, it appears that both Gnome and Firefox (at least) try to be 'smart' and will recognise UTF-8 even if ISO-8859-1 is set as the locale. Having said that, it's not too smart; if I try to open a file with a UTF-8 encoded name, Firefox can't find it when the locale isn't UTF-8 (although the file chooser *does* see it).

This is an important point; the advice above that they "MAY be interpreted according to the preferred or configured encoding of the system on which the URI is being interpreted" doesn't account for the fact that a single filesystem might have several users who have different encodings set*.

Other encodings seem to just be percent-encoded straight into the file URI, and Firefox doesn't make any attempt to display them as IRIs. 

(This seems to mirror how most browsers handle non-ascii characters in HTTP headers, e.g., Location; they just percent-encode them, since that's an encoding of bytes, not characters).

Can we say something more like this?

When a file URI is produced, characters not allowed by the ABNF MUST be percent-encoded as characters using UTF-8 encoding, as per RFC3986 Section 2.5. 

However, encoding information for file and/or directory names might not be available. In these cases, implementations MAY use heuristics to determine the encoding. If that fails, they SHOULD percent-encode the raw bytes of the label directly.


* Possible but unlikely, since most people are going to be using UTF-8. Still...

Mark Nottingham