Re: [apps-discuss] URL definitions and draft-ruby-url-problem

Sam Ruby <rubys@intertwingly.net> Fri, 19 December 2014 19:47 UTC

Return-Path: <rubys@intertwingly.net>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 39A8C1A6FE5 for <apps-discuss@ietfa.amsl.com>; Fri, 19 Dec 2014 11:47:14 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id JTmzPNRy3pVT for <apps-discuss@ietfa.amsl.com>; Fri, 19 Dec 2014 11:47:11 -0800 (PST)
Received: from cdptpa-oedge-vip.email.rr.com (cdptpa-outbound-snat.email.rr.com [107.14.166.231]) by ietfa.amsl.com (Postfix) with ESMTP id 4E65B1ACD88 for <apps-discuss@ietf.org>; Fri, 19 Dec 2014 11:47:11 -0800 (PST)
Received: from [98.27.51.253] ([98.27.51.253:52519] helo=rubix) by cdptpa-oedge02 (envelope-from <rubys@intertwingly.net>) (ecelerity 3.5.0.35861 r(Momo-dev:tip)) with ESMTP id 0C/F0-29348-EB084945; Fri, 19 Dec 2014 19:47:10 +0000
Received: from [192.168.1.102] (unknown [192.168.1.102]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: rubys) by rubix (Postfix) with ESMTPSA id 004DE140D25; Fri, 19 Dec 2014 14:47:09 -0500 (EST)
Message-ID: <549480BD.6080309@intertwingly.net>
Date: Fri, 19 Dec 2014 14:47:09 -0500
From: Sam Ruby <rubys@intertwingly.net>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0
MIME-Version: 1.0
To: Bjoern Hoehrmann <derhoermi@gmx.net>
References: <B53877D1-0996-448F-982D-4536805F2B1E@vpnc.org> <00o89a147re95aor21u3l9a7aarrhg0vts@hive.bjoern.hoehrmann.de>
In-Reply-To: <00o89a147re95aor21u3l9a7aarrhg0vts@hive.bjoern.hoehrmann.de>
Content-Type: text/plain; charset="windows-1252"; format="flowed"
Content-Transfer-Encoding: 7bit
X-RR-Connecting-IP: 107.14.168.130:25
X-Cloudmark-Score: 0
Archived-At: http://mailarchive.ietf.org/arch/msg/apps-discuss/itrBpLovEnKxe_Rka9WO_vnisJg
Cc: Paul Hoffman <paul.hoffman@vpnc.org>, Apps Discuss <apps-discuss@ietf.org>
Subject: Re: [apps-discuss] URL definitions and draft-ruby-url-problem
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss/>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 19 Dec 2014 19:47:14 -0000

On 12/19/2014 01:41 PM, Bjoern Hoehrmann wrote:
> * Paul Hoffman wrote:
>> This seems like an important document for us to look at, and possibly
>> adopt. Section 3 is pretty scary, and section 4 seems like a very
>> reasonable solution.
>
> I have reviewed this document. Sections 1 and 2 seem reasonable to me.

Thanks!

> Section 3 has
>
>     The main problem is conflicting specifications that overlap but don't
>     match each other.
>
>     Additionally, the following are issues that need to be resolves to
>     make URL processing unambiguous and stable.
>
>     o  Nomenclature: over the years, a number of different sets of
>        terminology has been used.  URL / URI / IRI is not the only
>        difference.  [tantek-slice] chronicles a number of differences.
>
> The latter refers to differences among APIs for manipulating resource
> identifiers. I do not think that is a problem and does not need solving.

Can I ask why not?

To be clear, I'm not suggesting that everybody adopt the same terms.  I 
am suggesting that somewhere we document what terms are in use and how 
they map.

>     o  Parameterization: standards in this area need to define such
>        matters as normalization forms and values for parameters such as
>        UseSTD3ASCIIRules.
>
> Where the relevant standards allow implementations to choose options, it
> is usually because there are good reasons to do so. Implementers ought
> to document their choices properly, and it is a good thing when similar
> implementations make the same choices, it might even be useful to have a
> specification saying "Web browsers must use these options: A, B, C", but
> that would just be a matter of doing it. So I am not sure what problem
> needs solving in this regard.

First, it is not just web browsers.  It is runtime libraries that 
accompany various programming languages.  It is code embedded in word 
processors and web servers.  I don't believe that having the URLs mean 
different things based on the context is in any body's best interests.

At a minimum, we should consider clearly documenting the set of URLs 
that are may be interpreted differently in different contexts.  Perhaps 
we should identify those URLs as non-conforming.

However, that's not enough.  Consider ICANN approved non-ASCII domain 
names.  Different RFC 3986 compliant libraries handle https URLs with 
such hostnames differently.  We can't simply tell people to avoid such.

>     o  Interoperability: even after accounting for the above, there is a
>        demonstrable lack of interoperability across popular libraries and
>        browsers.  [whatwg-interop] identifies a number of such
>        differences.
>
> There are different classes of problems in this regard, e.g. there may
> be existing requirements that are widely ignored, there may be ambiguous
> requirements interpreted differently across implementations, there may
> be implementation-defined behavior that varies across implementations in
> a harmful manner, or there may be widely deployed behavior where further
> standardisation might be useful, to mention a few. I think it would be
> helpful to discuss these classes of problems separately.

I agree!  To seed the discussion, I offer the following web page with a 
number of interesting test cases:

https://url.spec.whatwg.org/interop/test-results/

Feel free to propose additional test cases, suggest categorizations, or 
changes to either specs or to libraries, or even additional libraries 
that should be included.

>     o  Specific scheme definitions: some UR* scheme definitions are
>        woefully out of date, incomplete, or don't correspond to current
>        practice, but updating their definitions is unclear.  This
>        includes "file:", for which there is a current effort, but there
>        are others which need review (including 'ftp:', 'data').
>
> An open question here seems to be how to separate concerns. Can updating
> specifications for individual schemes be done independently? If so, that
> also would seem to simply require somebody doing it, so I am not sure of
> the problem indicated here.

One thing that is important to recognize is that every modern 
programming language is going to have a URI or URL parse function, 
method, subroutine, or whatever.  While it may make sense to allow new 
schemes to impose additional validity requirements specific to their 
scheme or additional semantics, it is in everybody's best interests if 
we define a standard way in which URLs which make use of unknown URL 
schemes are to be parsed.

Note: I am *NOT* suggesting that what currently is in the WHATWG URL 
Living Standard is that definition.  I think we need something a bit 
closer to what RFC 3986 defines.

I welcome input on what that behavior should be.  Let's work on it 
together.  Meanwhile, 
https://www.w3.org/Bugs/Public/show_bug.cgi?id=27233 is open on this issue.

> As for the "Outline of Potential Solution" in section 4, I agree that a
> plan should be built. How many specifications, for instance, should we
> have, what would be their scope, and what would be their contents? With-
> out a plan, considering the other points in the section seems premature;
> perhaps some of them are reasonable things to do independently of any
> plan; in that case, they could simply be done.

There is indeed a big chicken and egg problem here.

My best guess at this moment is that RFC 3987 needs to be retired and 
work needs to start on a RFC 3986bis, and that probably needs a new 
Working Group.  And that needs the discussion we are currently having to 
happen.

I will say that I'm willing to work with anybody, anywhere.  If work on 
a RFC 3986bis starts, I will contribute.

I also am not looking to make this somebody else's problem.  If it means 
I need to step up to be that editor, I will do that too.  And in the 
process, I will actively encourage others to contribute, and by that I 
mean GitHub pull requests.

But I will also say that while we should be having the high level 
discussion at this time and in parallel to the technical discussions 
about how various classes of input strings should be parsed, it really 
is the latter (parser) discussion that we should be doing the deep dive on.

Once we have a starter set list of proposed changes, we can circle back 
and determine whether a set of errata is sufficient or if the overhead 
of an entire Working Group is required.

However, that is just my preference.  Should you be interested in 
suggesting changes to draft-ruby-url-problem, I encourage pull requests 
for that too.  Julian Reschke has already submitted a few.  You could be 
next!

- Sam Ruby