Re: [apps-discuss] Fun with URLs and regex

Sam Ruby <rubys@intertwingly.net> Wed, 28 January 2015 12:32 UTC

Return-Path: <rubys@intertwingly.net>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5C60D1A1A2F for <apps-discuss@ietfa.amsl.com>; Wed, 28 Jan 2015 04:32:51 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Zxu5a_6Z6Dck for <apps-discuss@ietfa.amsl.com>; Wed, 28 Jan 2015 04:32:47 -0800 (PST)
Received: from cdptpa-oedge-vip.email.rr.com (cdptpa-outbound-snat.email.rr.com [107.14.166.228]) by ietfa.amsl.com (Postfix) with ESMTP id 225531A1B26 for <apps-discuss@ietf.org>; Wed, 28 Jan 2015 04:32:46 -0800 (PST)
Received: from [98.27.51.253] ([98.27.51.253:5713] helo=rubix) by cdptpa-oedge02 (envelope-from <rubys@intertwingly.net>) (ecelerity 3.5.0.35861 r(Momo-dev:tip)) with ESMTP id 1C/E2-22623-DE6D8C45; Wed, 28 Jan 2015 12:32:45 +0000
Received: from [192.168.1.102] (unknown [192.168.1.102]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: rubys) by rubix (Postfix) with ESMTPSA id 1B59B140742; Wed, 28 Jan 2015 07:32:46 -0500 (EST)
Message-ID: <54C8D6EC.4030306@intertwingly.net>
Date: Wed, 28 Jan 2015 07:32:44 -0500
From: Sam Ruby <rubys@intertwingly.net>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: Mark Nottingham <mnot@mnot.net>
References: <C5B10293-E6F6-4348-9782-C9C00A4476CE@mnot.net> <CACweHNBVOrVMesB7HOjPNHe5FtzL1k9XDGAHUXAx5DbOSYv5jA@mail.gmail.com> <A1E5B0EC-FAD5-4178-8C7B-540BEB61DC06@mnot.net> <54AEB660.1020701@intertwingly.net> <F122ADA8-4A96-4F88-BB9F-3C5C6A544067@mnot.net> <54C84872.5040902@intertwingly.net> <EF1E36FA-6A30-4A65-9520-5A31571EE445@mnot.net>
In-Reply-To: <EF1E36FA-6A30-4A65-9520-5A31571EE445@mnot.net>
Content-Type: text/plain; charset="windows-1252"; format="flowed"
Content-Transfer-Encoding: 7bit
X-RR-Connecting-IP: 107.14.168.130:25
X-Cloudmark-Score: 0
Archived-At: <http://mailarchive.ietf.org/arch/msg/apps-discuss/Trhz66nTmHxjFes-Yyd1g4ELs-k>
Cc: IETF Apps Discuss <apps-discuss@ietf.org>
Subject: Re: [apps-discuss] Fun with URLs and regex
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss/>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Jan 2015 12:32:51 -0000

On 01/28/2015 12:36 AM, Mark Nottingham wrote:
>
>> On 28 Jan 2015, at 1:24 pm, Sam Ruby <rubys@intertwingly.net> wrote:
>>
>> On 1/27/15 8:53 PM, Mark Nottingham wrote:
>>> Hi Sam,
>>>
>>>> On 9 Jan 2015, at 3:54 am, Sam Ruby <rubys@intertwingly.net> wrote:
>>>>
>>>> Mark cares about valid URIs.  He's certainly not alone in that.  What he has done is express his interests not merely in high level prose, but in concrete, executable form.  Given that he has done that, I can pose some interesting questions.  For example, if you consider the process of canonicalizing a href value on an <a> element and stringifying the result, an implementation like Chrome will produce something that will be sent across the wire.  I've captured the results here:
>>>>
>>>> https://raw.githubusercontent.com/webspecs/url/develop/evaluate/useragent-results/chrome
>>>>
>>>> Given that data and Mark's script, I can produce a list of outputs that Mark doesn't consider valid:
>>>
>>> [...]
>>>
>>>> With this data, we can have a discussion as to whether Mark's script should be updated, or Chrome should change, or some spec should change.
>>>
>>> What I was hoping for was an update of the "valid URI" filter to take this into account at <https://url.spec.whatwg.org/interop/test-results/?filter=valid>.
>>>
>>> E.g., that list currently includes test case 63, "https:/example.com/", which is valid according to the generic syntax in RFC3986, but not when you consider the scheme-specific constraints for HTTPS in RFC7230.
>>>
>>> By filtering out these cases, we can see the places we potentially need to pay attention to in the RFCs.
>>>
>>> Is that possible?
>>
>> Since that code runs filters on the browser, it would be easiest for me integrate code written in JavaScript.
>>
>> Can you review the generated JavaScript?
>>
>> https://url.spec.whatwg.org/reference-implementation/uri-validate.js
>> https://url.spec.whatwg.org/reference-implementation/uri-validate.html
>
> It looks like a reasonable transcription. I do notice you copied my error:
>
> return new RegExp("^" + known[scheme] + "(#|$)").test(string)

Not a copy: it's actually different.  My version checks for a match on 
the regular expression followed by either a hash mark or the end of the 
string.

> I think it should just be:
>
> return new RegExp("^" + known[scheme] + "$").test(string)
>
> ... which brings about another interesting observation -- only http and https define fragments in their syntax; the other schemes do not.

I can make that change.

>> Also, I would like to discuss where this code should live:
>>
>> http://www.ietf.org/mail-archive/web/apps-discuss/current/msg13635.html
>
> I want to keep it going in Python, so I'll probably create a separate repo at some point; is it bothersome to just keep the JS in your repo?

I'm not proposing changing it from Python.

What I have is a script that does the conversion:

https://gist.github.com/rubys/b9ccaf304f06cf3e2e88

> If bugs are found in the regex themselves, I'll make sure to notify you (and would appreciate the same).

I'd suggest putting them both in the same repository.

In any case, later today I'll update my filter to replace the validation 
check with yours.

> Cheers,
>
> --
> Mark Nottingham   https://www.mnot.net/

- Sam Ruby