Re: [apps-discuss] Fun with URLs and regex

Mark Nottingham <mnot@mnot.net> Wed, 28 January 2015 05:36 UTC

Return-Path: <mnot@mnot.net>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 634841A0018 for <apps-discuss@ietfa.amsl.com>; Tue, 27 Jan 2015 21:36:33 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.602
X-Spam-Level:
X-Spam-Status: No, score=-2.602 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id cDSTsZNEvJI3 for <apps-discuss@ietfa.amsl.com>; Tue, 27 Jan 2015 21:36:31 -0800 (PST)
Received: from mxout-07.mxes.net (mxout-07.mxes.net [216.86.168.182]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 50A281A0016 for <apps-discuss@ietf.org>; Tue, 27 Jan 2015 21:36:31 -0800 (PST)
Received: from [192.168.1.83] (unknown [118.209.44.193]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTPSA id 2955E22E264; Wed, 28 Jan 2015 00:36:24 -0500 (EST)
Content-Type: text/plain; charset="windows-1252"
Mime-Version: 1.0 (Mac OS X Mail 8.1 \(1993\))
From: Mark Nottingham <mnot@mnot.net>
In-Reply-To: <54C84872.5040902@intertwingly.net>
Date: Wed, 28 Jan 2015 16:36:20 +1100
Content-Transfer-Encoding: quoted-printable
Message-Id: <EF1E36FA-6A30-4A65-9520-5A31571EE445@mnot.net>
References: <C5B10293-E6F6-4348-9782-C9C00A4476CE@mnot.net> <CACweHNBVOrVMesB7HOjPNHe5FtzL1k9XDGAHUXAx5DbOSYv5jA@mail.gmail.com> <A1E5B0EC-FAD5-4178-8C7B-540BEB61DC06@mnot.net> <54AEB660.1020701@intertwingly.net> <F122ADA8-4A96-4F88-BB9F-3C5C6A544067@mnot.net> <54C84872.5040902@intertwingly.net>
To: Sam Ruby <rubys@intertwingly.net>
X-Mailer: Apple Mail (2.1993)
Archived-At: <http://mailarchive.ietf.org/arch/msg/apps-discuss/LlcwghSryUpFJV6qmsOVerZvaQ0>
Cc: IETF Apps Discuss <apps-discuss@ietf.org>
Subject: Re: [apps-discuss] Fun with URLs and regex
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss/>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Jan 2015 05:36:33 -0000

> On 28 Jan 2015, at 1:24 pm, Sam Ruby <rubys@intertwingly.net> wrote:
> 
> On 1/27/15 8:53 PM, Mark Nottingham wrote:
>> Hi Sam,
>> 
>>> On 9 Jan 2015, at 3:54 am, Sam Ruby <rubys@intertwingly.net> wrote:
>>> 
>>> Mark cares about valid URIs.  He's certainly not alone in that.  What he has done is express his interests not merely in high level prose, but in concrete, executable form.  Given that he has done that, I can pose some interesting questions.  For example, if you consider the process of canonicalizing a href value on an <a> element and stringifying the result, an implementation like Chrome will produce something that will be sent across the wire.  I've captured the results here:
>>> 
>>> https://raw.githubusercontent.com/webspecs/url/develop/evaluate/useragent-results/chrome
>>> 
>>> Given that data and Mark's script, I can produce a list of outputs that Mark doesn't consider valid:
>> 
>> [...]
>> 
>>> With this data, we can have a discussion as to whether Mark's script should be updated, or Chrome should change, or some spec should change.
>> 
>> What I was hoping for was an update of the "valid URI" filter to take this into account at <https://url.spec.whatwg.org/interop/test-results/?filter=valid>.
>> 
>> E.g., that list currently includes test case 63, "https:/example.com/", which is valid according to the generic syntax in RFC3986, but not when you consider the scheme-specific constraints for HTTPS in RFC7230.
>> 
>> By filtering out these cases, we can see the places we potentially need to pay attention to in the RFCs.
>> 
>> Is that possible?
> 
> Since that code runs filters on the browser, it would be easiest for me integrate code written in JavaScript.
> 
> Can you review the generated JavaScript?
> 
> https://url.spec.whatwg.org/reference-implementation/uri-validate.js
> https://url.spec.whatwg.org/reference-implementation/uri-validate.html

It looks like a reasonable transcription. I do notice you copied my error:

return new RegExp("^" + known[scheme] + "(#|$)").test(string)

I think it should just be:

return new RegExp("^" + known[scheme] + "$").test(string)

... which brings about another interesting observation -- only http and https define fragments in their syntax; the other schemes do not.


> Also, I would like to discuss where this code should live:
> 
> http://www.ietf.org/mail-archive/web/apps-discuss/current/msg13635.html

I want to keep it going in Python, so I'll probably create a separate repo at some point; is it bothersome to just keep the JS in your repo?

If bugs are found in the regex themselves, I'll make sure to notify you (and would appreciate the same).

Cheers,



--
Mark Nottingham   https://www.mnot.net/