Re: [apps-discuss] Fun with URLs and regex

Sam Ruby <rubys@intertwingly.net> Thu, 08 January 2015 16:55 UTC

Return-Path: <rubys@intertwingly.net>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C385C1A87EF for <apps-discuss@ietfa.amsl.com>; Thu, 8 Jan 2015 08:55:01 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.898
X-Spam-Level:
X-Spam-Status: No, score=-1.898 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, NORMAL_HTTP_TO_IP=0.001, RCVD_IN_DNSWL_NONE=-0.0001, WEIRD_PORT=0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fzmlQEscCsG7 for <apps-discuss@ietfa.amsl.com>; Thu, 8 Jan 2015 08:54:59 -0800 (PST)
Received: from cdptpa-oedge-vip.email.rr.com (cdptpa-outbound-snat.email.rr.com [107.14.166.227]) by ietfa.amsl.com (Postfix) with ESMTP id 5B7541A87EC for <apps-discuss@ietf.org>; Thu, 8 Jan 2015 08:54:59 -0800 (PST)
Received: from [98.27.51.253] ([98.27.51.253:62337] helo=rubix) by cdptpa-oedge02 (envelope-from <rubys@intertwingly.net>) (ecelerity 3.5.0.35861 r(Momo-dev:tip)) with ESMTP id D9/EB-31080-266BEA45; Thu, 08 Jan 2015 16:54:58 +0000
Received: from [192.168.1.115] (unknown [192.168.1.115]) (using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: rubys) by rubix (Postfix) with ESMTPSA id 8BFC3140CFD; Thu, 8 Jan 2015 11:54:57 -0500 (EST)
Message-ID: <54AEB660.1020701@intertwingly.net>
Date: Thu, 08 Jan 2015 11:54:56 -0500
From: Sam Ruby <rubys@intertwingly.net>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0
MIME-Version: 1.0
To: Mark Nottingham <mnot@mnot.net>, Matthew Kerwin <matthew@kerwin.net.au>
References: <C5B10293-E6F6-4348-9782-C9C00A4476CE@mnot.net> <CACweHNBVOrVMesB7HOjPNHe5FtzL1k9XDGAHUXAx5DbOSYv5jA@mail.gmail.com> <A1E5B0EC-FAD5-4178-8C7B-540BEB61DC06@mnot.net>
In-Reply-To: <A1E5B0EC-FAD5-4178-8C7B-540BEB61DC06@mnot.net>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
X-RR-Connecting-IP: 107.14.168.130:25
X-Cloudmark-Score: 0
Archived-At: <http://mailarchive.ietf.org/arch/msg/apps-discuss/pLPD4BSGWWmqw52m9dsDrJug0-U>
Cc: Alex Russell <slightlyoff@google.com>, Domenic Denicola <d@domenic.me>, IETF Apps Discuss <apps-discuss@ietf.org>
Subject: Re: [apps-discuss] Fun with URLs and regex
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss/>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 08 Jan 2015 16:55:02 -0000

On 01/08/2015 10:07 AM, Mark Nottingham wrote:
> Fixed, thanks.
>
> I’m hoping to put this into a proper github project soon and refactor it, to make reading it and making contributions easier.

First, Mark: thanks for doing this!

Would you consider putting it in https://github.com/webspecs/url, 
perhaps in the evaluate directory?

Wherever it is placed, I plan to write scripts that make use of this 
information.

For context (and in full disclosure, I discussed much of this F2F with 
Mark yesterday at the TAG meeting, but am including the background here 
for everybody's benefit):

Different people care about different things, and that's all right.  As 
an example, I was able to surprise Alex Russell and Dominic Denicola 
(both Google employees) by showing that the following expression 
produced different results on Chrome on Windows as compared to Chrome on 
OS/X:

   new URL("file:c|foo").pathname

Why does this matter?  Well in the case of web browsers, developers of 
those web browsers want content that shows up on the web to behave 
interoperabily, both across platforms, and across competing 
implementations.   More specifically, if you can construct a web page 
(including javascript) that produces different results on Internet 
Explorer on Windows vs Apple's Safari on OS/X, then that's a problem.

In the case of libraries like Node.js, there is a desire to parse URLs 
just like web browsers do.  Given that Dominic cares about node.js and 
Chrome, I can point him to date such as the following

https://url.spec.whatwg.org/interop/test-results/?select=nodejs&baseline=chrome

Looks like there is quite a bit of yellow on that page.

Since I care about the URL Standard, I created converted the prose into 
executable form so that it could be compared.  This allows me to very 
easily produce a similar page comparing nodejs and (by proxy) the URL 
standard:

https://url.spec.whatwg.org/interop/test-results/?select=nodejs

With this data, we can triangulate: and have a discussion with real data 
as to whether node.js should change or whether the URL standard should 
change.

Mark cares about valid URIs.  He's certainly not alone in that.  What he 
has done is express his interests not merely in high level prose, but in 
concrete, executable form.  Given that he has done that, I can pose some 
interesting questions.  For example, if you consider the process of 
canonicalizing a href value on an <a> element and stringifying the 
result, an implementation like Chrome will produce something that will 
be sent across the wire.  I've captured the results here:

https://raw.githubusercontent.com/webspecs/url/develop/evaluate/useragent-results/chrome

Given that data and Mark's script, I can produce a list of outputs that 
Mark doesn't consider valid:

   "http://f:21/%20b%20?%20d%20# e": false,
   "http://example.com/foo%": false,
   "http://2001::1]/": false,
   "http://[www.google.com]/": false,
   "http://f:fifty-two/c": false,
   "http://foo:-80/": false,
   "http://example.com/foo/.%2": false,
   "http://2001::1/": false,
   "http://[google.com]/": false,
   "http://example.org/foo/bar#\\": false,
   "http://example.org/foo/[61:24:74]:98": false,
   "gopher://example.com/": false,
   "http://example.org/foo/bar#\u03b2": false,
   "http://example.com/foo%2%C3%82%C2%A9zbar": false,
   "gopher://foo/": false,
   "http://example.com/foo%2zbar": false,
   "http://f:b/c": false,
   "a: foo.com": false,
   "http://[1::2]:3:4/": false,
   "http://example.org/foo/[61:27]/:foo": false,
   "http://f:%2021%20/%20b%20?%20d%20# e": false,
   "http://example.com/foo%2": false,
   "http://foo/path;a??e#f#g": false,
   "data:example.com/": false,
   "http://www.google.com/foo?bar=baz# \u00bb": false,
   "http://f:%20/c": false,
   "http://:www.example.com/": false,
   "data:/example.com/": false

With this data, we can have a discussion as to whether Mark's script 
should be updated, or Chrome should change, or some spec should change.

I have taken this a step further.  I wrote a script that will convert 
Mark's script from Python to JavaScript.  This means that he can 
maintain his script in Python and I can include an equivalent script on 
web pages and use that information to filter out things that aren't 
interesting or highlight things that are.

Some people here may have different interests, and that is OK too.  I 
encourage everybody to find a way to express their interests in the form 
of code.  While my preference is JavaScript, I'm otherwise pretty 
agnostic.  XSLT, Perl, PHP, C#: doesn't matter to me.

If you do so, I'll find a way to include that as a filter or as a base 
for comparison.  Doing so means that as the test suite grows, or 
implementation results change(*), you will be able to get instant 
results to the questions that interest you.

- Sam Ruby

(*) As an example, I'm trying to get updated results for IE.  Current 
status:

http://intertwingly.net/blog/2015/01/08/Ununzippable-Modern-IE