Re: [apps-discuss] Fun with URLs and regex

Mark Nottingham <mnot@mnot.net> Wed, 28 January 2015 01:54 UTC

Return-Path: <mnot@mnot.net>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D2E021A1ABE for <apps-discuss@ietfa.amsl.com>; Tue, 27 Jan 2015 17:54:01 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.602
X-Spam-Level:
X-Spam-Status: No, score=-2.602 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Yoj6yMU3DGrw for <apps-discuss@ietfa.amsl.com>; Tue, 27 Jan 2015 17:53:59 -0800 (PST)
Received: from mxout-07.mxes.net (mxout-07.mxes.net [216.86.168.182]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 2C4691A1AB1 for <apps-discuss@ietf.org>; Tue, 27 Jan 2015 17:53:59 -0800 (PST)
Received: from [192.168.1.83] (unknown [118.209.44.193]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTPSA id A5E0722E261; Tue, 27 Jan 2015 20:53:52 -0500 (EST)
Content-Type: text/plain; charset="us-ascii"
Mime-Version: 1.0 (Mac OS X Mail 8.1 \(1993\))
From: Mark Nottingham <mnot@mnot.net>
In-Reply-To: <54AEB660.1020701@intertwingly.net>
Date: Wed, 28 Jan 2015 12:53:48 +1100
Content-Transfer-Encoding: quoted-printable
Message-Id: <F122ADA8-4A96-4F88-BB9F-3C5C6A544067@mnot.net>
References: <C5B10293-E6F6-4348-9782-C9C00A4476CE@mnot.net> <CACweHNBVOrVMesB7HOjPNHe5FtzL1k9XDGAHUXAx5DbOSYv5jA@mail.gmail.com> <A1E5B0EC-FAD5-4178-8C7B-540BEB61DC06@mnot.net> <54AEB660.1020701@intertwingly.net>
To: Sam Ruby <rubys@intertwingly.net>
X-Mailer: Apple Mail (2.1993)
Archived-At: <http://mailarchive.ietf.org/arch/msg/apps-discuss/O6GEU3Oj6bb_7c_n7wKr-o9PC0w>
Cc: Alex Russell <slightlyoff@google.com>, Domenic Denicola <d@domenic.me>, IETF Apps Discuss <apps-discuss@ietf.org>
Subject: Re: [apps-discuss] Fun with URLs and regex
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss/>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Jan 2015 01:54:02 -0000

Hi Sam,

> On 9 Jan 2015, at 3:54 am, Sam Ruby <rubys@intertwingly.net> wrote:
> 
> Mark cares about valid URIs.  He's certainly not alone in that.  What he has done is express his interests not merely in high level prose, but in concrete, executable form.  Given that he has done that, I can pose some interesting questions.  For example, if you consider the process of canonicalizing a href value on an <a> element and stringifying the result, an implementation like Chrome will produce something that will be sent across the wire.  I've captured the results here:
> 
> https://raw.githubusercontent.com/webspecs/url/develop/evaluate/useragent-results/chrome
> 
> Given that data and Mark's script, I can produce a list of outputs that Mark doesn't consider valid:

[...]

> With this data, we can have a discussion as to whether Mark's script should be updated, or Chrome should change, or some spec should change.

What I was hoping for was an update of the "valid URI" filter to take this into account at <https://url.spec.whatwg.org/interop/test-results/?filter=valid>.

E.g., that list currently includes test case 63, "https:/example.com/", which is valid according to the generic syntax in RFC3986, but not when you consider the scheme-specific constraints for HTTPS in RFC7230.

By filtering out these cases, we can see the places we potentially need to pay attention to in the RFCs.

Is that possible?

Cheers,


--
Mark Nottingham   https://www.mnot.net/