Re: [pkix] [apps-discuss] character repertoire for fragment identifiers, was: Fwd: FW: New Version Notification for draft-kerwin-file-scheme-13.txt

Sean Leonard <dev+ietf@seantek.com> Sun, 11 January 2015 22:11 UTC

Message-ID: <54B2F4C3.5020008@seantek.com>
Date: Sun, 11 Jan 2015 14:10:11 -0800
From: Sean Leonard <dev+ietf@seantek.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0
MIME-Version: 1.0
To: Sam Ruby <rubys@intertwingly.net>, Julian Reschke <julian.reschke@gmx.de>, Mark Nottingham <mnot@mnot.net>
References: <20140926010029.26660.82167.idtracker@ietfa.amsl.com> <EAACE200D9B0224D94BF52CF2DD166A425A68A90@ex10mb6.qut.edu.au> <CACweHNBEYRFAuw9-vfeyd_wf703cvM3ykZoRMqAokRFYG_O7hQ@mail.gmail.com> <DM2PR0201MB09602B351692D424A49C6B0DC3650@DM2PR0201MB0960.namprd02.prod.outlook.com> <CACweHNBN_Bv=jeXQ_VwXi2HzHKNEwZJ1NiF-BJJo_9-mhO60gQ@mail.gmail.com> <54A557E1.6050502@intertwingly.net> <CACweHNCQZg1U1u8U=-f6h0+BPnp6Wr_T=r_wGiPAbhTbuMCGWQ@mail.gmail.com> <54A94109.5010901@intertwingly.net> <00cf01d02cc7$d5dba4c0$4001a8c0@gateway.2wire.net> <54B16C2B.9050604@seantek.com> <54B17BBE.4000900@intertwingly.net> <54B18B61.8010308@seantek.com> <54B19435.8070401@intertwingly.net> <54B1B211.3050807@seantek.com> <54B1B682.3070609@intertwingly.net> <54B28E0F.8070306@gmx.de> <54B2936B.7030805@intertwingly.net> <05AD7DE2-1C54-45CD-B33A-13766D771E57@mnot.net> <54B2A2CD.5080502@gmx.de> <1A5BBD25-FEBD-49B1-9EFB-4EF8877BF0E7@mnot.net> <54B2A4F9.2070909@gmx.de> <54B2A894.4020201@intertwingly.net>
In-Reply-To: <54B2A894.4020201@intertwingly.net>
Content-Type: multipart/alternative; boundary="------------010706050003040906040904"
Archived-At: <http://mailarchive.ietf.org/arch/msg/pkix/BeH7og4MCcR4bILcuCdnEx3EPpo>
Cc: "pkix@ietf.org" <pkix@ietf.org>, apps-discuss@ietf.org
Subject: Re: [pkix] [apps-discuss] character repertoire for fragment identifiers, was: Fwd: FW: New Version Notification for draft-kerwin-file-scheme-13.txt
Precedence: list

[Adding pkix@]
On the two intertwined points:

On 1/11/2015 8:28 AM, Julian Reschke wrote:
> On 2015-01-11 17:19, Sam Ruby wrote:
>> ...
>>> Now suffering from information overflow.
>>>
>>> We were discussing RFC 3986. Which *ASCII* characters that are 
>>> currently
>>> forbidden in fragment identifiers do you want to allow?
>>
>> We seem to be in a loop.
>>
>> To you, RFC 3986 (including potential errata and/or bis work) implies
>> ASCII.
>>
>> I point out that that restriction does not seem to make sense for
>> fragments.
>
> Actually, you did not. Instead you pointed to lots of material 
> somewhere else that I didn't want to parse just to find out what 
> you're proposing.
>
>> You return back to asking about ASCII.
>
> Because that's what RFC 3986 is concerned with.
>
>> To help break this loop, permit me to turn this question around.
>> Restricting the scope of the conversation to fragments, why do you
>> believe it makes sense to limit such to ASCII only?  What problems do
>> non-ASCII characters in fragments cause?
>
> I fail to see how fragments are special here compared to, say, the path.
>
> I fully agree that work on what we used to call IRIs is something that 
> needs to be done. However right now I'm trying to figure out what's 
> wrong with RFC *3986*.
>
> I hear you saying that URIs should allow non-ASCII characters in 
> certain places. This may break code where these characters are put on 
> the wire (such as HTTP) or stored in places that do not allow 
> non-ASCII characters (say, a database column).
>
> The fact that it's hard to extend the URI character repertoire beyond 
> ASCII is why the IETF attempted to do it in a separate spec/construct. 
> I'm not convinced that the situation has changed sufficiently to 
> invalidate that approach.
On 1/11/2015 8:45 AM, Sam Ruby wrote:
> On 01/11/2015 11:29 AM, Julian Reschke wrote:
>> On 2015-01-11 17:25, Mark Nottingham wrote:
>>> I and others have brought that up. What’s interesting is that they say
>>> it’s reasonably interoperable with deployed implementations.
>>>
>>> Cheers,
>>> ...
>>
>> Let me guess: "deployed implementations" == "what current browsers do".
>
> Your sarcasm is not appreciated.
>
> I encourage you to actually look at test results:
>
> https://url.spec.whatwg.org/interop/test-results/7b83ef3682

***

I fully agree with Julian on the matter of US-ASCII for URIs. URIs (RFC 
3986) are only made of US-ASCII characters.

If someone wishes to extend URIs (as opposed to IRIs or whatever) to 
include non-US-ASCII characters, that's a problem for web browsers and 
all other Internet software alike. This goes exactly to my point about 
protocol slots.

Certificates, CRLs, and other security objects are just as fundamentally 
a part of the Web (and web browser) infrastructure as HTML. In 
X.509/PKIX security objects, the GeneralName uniformResourceIdentifier 
construct is US-ASCII only (IA5String). If you extend "URIs" to be 
beyond US-ASCII, RFC 5280 has to be updated...and all the security 
libraries that depend upon it. Just because HTML(5) can be served as 
UTF-8 or use &amp; encoding or whatever, doesn't make the problem go away.

Does the URL Interop test-results explicitly test for certificates? I 
suggest attempting to put some non-US-ASCII characters in a GeneralName 
protocol slot (say, for revocation) and see what happens.

HTML 4.01 is at least consistent in saying (for its time) that hrefs and 
other things are URIs. For interoperable behavior, use US-ASCII 
characters only and stick with % encoding.

The security angle brings up another problem: the interoperable 
transcription of URIs across systems. The ASCII range is a limited 
repertoire, so it is easy to write it out unambiguously on paper, 
display it on a TV screen, say it over the radio or a public service 
announcement, or memorize it on your smartphone, in order to type it 
into your web browser, the command-line, or any other system of choice.

If you allow the enormous (and ever-expanding) range of Unicode 
characters in "URIs", all of those use cases become fundamentally 
ambiguous, inviting homograph attacks. Which smiley face out of nearly a 
hundred smiley emoji do you mean when you say "http://foo.com/😋" ?? How 
about an URI containing "ῗ" (U+1FD7 GREEK SMALL LETTER IOTA WITH 
DIALYTIKA AND PERISPOMENI)--what composition or decomposition mode? What 
if the combining accent mark code points are in a different order?

***
I have empathy for what Sam/the W3C wants, since the HTML protocol slots 
basically beg to be filled with Unicode strings like <a 
href="http://zh.wikipedia.org/wiki/巴泰勒米·波岡達"> (instead of <a 
href="http://zh.wikipedia.org/wiki/%E5%B7%B4%E6%B3%B0%E5%8B%92%E7%B1%B3%C2%B7%E6%B3%A2%E5%B2%A1%E9%81%94">).

But maybe the more interoperable approach is to define a format and 
mechanism (e.g., IRIs, or something like IRIs v2) to map /from ///the 
Unicode-capable protocol slots, /to/ the well-standardized RFC 3986 URI 
format.

My 2¢.

Sean

Re: [pkix] [apps-discuss] character repertoire fo… Sean Leonard
Re: [pkix] [apps-discuss] character repertoire fo… Sean Leonard
Re: [pkix] [apps-discuss] character repertoire fo… Sam Ruby
Re: [pkix] [apps-discuss] character repertoire fo… Sam Ruby
Re: [pkix] [apps-discuss] character repertoire fo… Martin J. Dürst
Re: [pkix] [apps-discuss] character repertoire fo… Nico Williams
Re: [pkix] [apps-discuss] character repertoire fo… Martin J. Dürst
Re: [pkix] [apps-discuss] character repertoire fo… Graham Klyne