Re: [apps-discuss] Fun with URLs and regex

Sam Ruby <rubys@intertwingly.net> Thu, 08 January 2015 16:55 UTC

Message-ID: <54AEB660.1020701@intertwingly.net>
Date: Thu, 08 Jan 2015 11:54:56 -0500
From: Sam Ruby <rubys@intertwingly.net>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0
MIME-Version: 1.0
To: Mark Nottingham <mnot@mnot.net>, Matthew Kerwin <matthew@kerwin.net.au>
References: <C5B10293-E6F6-4348-9782-C9C00A4476CE@mnot.net> <CACweHNBVOrVMesB7HOjPNHe5FtzL1k9XDGAHUXAx5DbOSYv5jA@mail.gmail.com> <A1E5B0EC-FAD5-4178-8C7B-540BEB61DC06@mnot.net>
In-Reply-To: <A1E5B0EC-FAD5-4178-8C7B-540BEB61DC06@mnot.net>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Archived-At: <http://mailarchive.ietf.org/arch/msg/apps-discuss/pLPD4BSGWWmqw52m9dsDrJug0-U>
Cc: Alex Russell <slightlyoff@google.com>, Domenic Denicola <d@domenic.me>, IETF Apps Discuss <apps-discuss@ietf.org>
Subject: Re: [apps-discuss] Fun with URLs and regex
Precedence: list

On 01/08/2015 10:07 AM, Mark Nottingham wrote:
> Fixed, thanks.
>
> I’m hoping to put this into a proper github project soon and refactor it, to make reading it and making contributions easier.

First, Mark: thanks for doing this!

Would you consider putting it in https://github.com/webspecs/url,
perhaps in the evaluate directory?

Wherever it is placed, I plan to write scripts that make use of this
information.

For context (and in full disclosure, I discussed much of this F2F with
Mark yesterday at the TAG meeting, but am including the background here
for everybody's benefit):

Different people care about different things, and that's all right. As
an example, I was able to surprise Alex Russell and Dominic Denicola
(both Google employees) by showing that the following expression
produced different results on Chrome on Windows as compared to Chrome on
OS/X:

new URL("file:c|foo").pathname

Why does this matter? Well in the case of web browsers, developers of
those web browsers want content that shows up on the web to behave
interoperabily, both across platforms, and across competing
implementations. More specifically, if you can construct a web page
(including javascript) that produces different results on Internet
Explorer on Windows vs Apple's Safari on OS/X, then that's a problem.

In the case of libraries like Node.js, there is a desire to parse URLs
just like web browsers do. Given that Dominic cares about node.js and
Chrome, I can point him to date such as the following

https://url.spec.whatwg.org/interop/test-results/?select=nodejs&baseline=chrome

Looks like there is quite a bit of yellow on that page.

Since I care about the URL Standard, I created converted the prose into
executable form so that it could be compared. This allows me to very
easily produce a similar page comparing nodejs and (by proxy) the URL
standard:

https://url.spec.whatwg.org/interop/test-results/?select=nodejs

With this data, we can triangulate: and have a discussion with real data
as to whether node.js should change or whether the URL standard should
change.

Mark cares about valid URIs. He's certainly not alone in that. What he
has done is express his interests not merely in high level prose, but in
concrete, executable form. Given that he has done that, I can pose some
interesting questions. For example, if you consider the process of
canonicalizing a href value on an <a> element and stringifying the
result, an implementation like Chrome will produce something that will
be sent across the wire. I've captured the results here:

https://raw.githubusercontent.com/webspecs/url/develop/evaluate/useragent-results/chrome

Given that data and Mark's script, I can produce a list of outputs that
Mark doesn't consider valid:

"http://f:21/%20b%20?%20d%20# e": false,
"http://example.com/foo%": false,
"http://2001::1]/": false,
"http://[www.google.com]/": false,
"http://f:fifty-two/c": false,
"http://foo:-80/": false,
"http://example.com/foo/.%2": false,
"http://2001::1/": false,
"http://[google.com]/": false,
"http://example.org/foo/bar#\\": false,
"http://example.org/foo/[61:24:74]:98": false,
"gopher://example.com/": false,
"http://example.org/foo/bar#\u03b2": false,
"http://example.com/foo%2%C3%82%C2%A9zbar": false,
"gopher://foo/": false,
"http://example.com/foo%2zbar": false,
"http://f:b/c": false,
"a: foo.com": false,
"http://[1::2]:3:4/": false,
"http://example.org/foo/[61:27]/:foo": false,
"http://f:%2021%20/%20b%20?%20d%20# e": false,
"http://example.com/foo%2": false,
"http://foo/path;a??e#f#g": false,
"data:example.com/": false,
"http://www.google.com/foo?bar=baz# \u00bb": false,
"http://f:%20/c": false,
"http://:www.example.com/": false,
"data:/example.com/": false

With this data, we can have a discussion as to whether Mark's script
should be updated, or Chrome should change, or some spec should change.

I have taken this a step further. I wrote a script that will convert
Mark's script from Python to JavaScript. This means that he can
maintain his script in Python and I can include an equivalent script on
web pages and use that information to filter out things that aren't
interesting or highlight things that are.

Some people here may have different interests, and that is OK too. I
encourage everybody to find a way to express their interests in the form
of code. While my preference is JavaScript, I'm otherwise pretty
agnostic. XSLT, Perl, PHP, C#: doesn't matter to me.

If you do so, I'll find a way to include that as a filter or as a base
for comparison. Doing so means that as the test suite grows, or
implementation results change(*), you will be able to get instant
results to the questions that interest you.

- Sam Ruby

(*) As an example, I'm trying to get updated results for IE. Current
status:

http://intertwingly.net/blog/2015/01/08/Ununzippable-Modern-IE

[apps-discuss] Fun with URLs and regex Mark Nottingham
Re: [apps-discuss] Fun with URLs and regex Sam Ruby
Re: [apps-discuss] Fun with URLs and regex Matthew Kerwin
Re: [apps-discuss] Fun with URLs and regex Bjoern Hoehrmann
Re: [apps-discuss] Fun with URLs and regex Martin Thomson
Re: [apps-discuss] Fun with URLs and regex Martin J. Dürst
Re: [apps-discuss] Fun with URLs and regex Mark Nottingham
Re: [apps-discuss] Fun with URLs and regex Mark Nottingham
Re: [apps-discuss] Fun with URLs and regex Sam Ruby
Re: [apps-discuss] Fun with URLs and regex Sam Ruby
Re: [apps-discuss] Fun with URLs and regex Mark Nottingham
Re: [apps-discuss] Fun with URLs and regex Mark Nottingham
Re: [apps-discuss] Fun with URLs and regex Sam Ruby
Re: [apps-discuss] Fun with URLs and regex Mark Nottingham
Re: [apps-discuss] Fun with URLs and regex Sam Ruby
Re: [apps-discuss] Fun with URLs and regex Nico Williams
Re: [apps-discuss] Fun with URLs and regex Julian Reschke
Re: [apps-discuss] Fun with URLs and regex Roy T. Fielding
Re: [apps-discuss] Fun with URLs and regex Sam Ruby
Re: [apps-discuss] Fun with URLs and regex Julian Reschke
Re: [apps-discuss] Fun with URLs and regex Roy T. Fielding
Re: [apps-discuss] Fun with URLs and regex Mark Nottingham
Re: [apps-discuss] Fun with URLs and regex Nico Williams
Re: [apps-discuss] Fun with URLs and regex Nico Williams
Re: [apps-discuss] Fun with URLs and regex Matthew Kerwin
Re: [apps-discuss] Fun with URLs and regex Larry Masinter
Re: [apps-discuss] Fun with URLs and regex Roy T. Fielding
Re: [apps-discuss] Fun with URLs and regex Matthew Kerwin
Re: [apps-discuss] Fun with URLs and regex Julian Reschke
Re: [apps-discuss] Fun with URLs and regex Sean Leonard
Re: [apps-discuss] Fun with URLs and regex t.petch
Re: [apps-discuss] Fun with URLs and regex Sam Ruby
Re: [apps-discuss] Fun with URLs and regex Bjoern Hoehrmann