Re: [apps-discuss] Fun with URLs and regex

Julian Reschke <julian.reschke@gmx.de> Wed, 28 January 2015 21:15 UTC

Return-Path: <julian.reschke@gmx.de>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id F1F631A1AA7 for <apps-discuss@ietfa.amsl.com>; Wed, 28 Jan 2015 13:15:06 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id R5Vyfp4xQGDy for <apps-discuss@ietfa.amsl.com>; Wed, 28 Jan 2015 13:15:04 -0800 (PST)
Received: from mout.gmx.net (mout.gmx.net [212.227.15.19]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C7BDD1A1AB1 for <apps-discuss@ietf.org>; Wed, 28 Jan 2015 13:14:37 -0800 (PST)
Received: from [192.168.2.175] ([93.217.85.143]) by mail.gmx.com (mrgmx003) with ESMTPSA (Nemesis) id 0MCLx3-1YP8W139fd-0096LC; Wed, 28 Jan 2015 22:14:32 +0100
Message-ID: <54C95132.2060402@gmx.de>
Date: Wed, 28 Jan 2015 22:14:26 +0100
From: Julian Reschke <julian.reschke@gmx.de>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: Mark Nottingham <mnot@mnot.net>, Sam Ruby <rubys@intertwingly.net>
References: <C5B10293-E6F6-4348-9782-C9C00A4476CE@mnot.net> <CACweHNBVOrVMesB7HOjPNHe5FtzL1k9XDGAHUXAx5DbOSYv5jA@mail.gmail.com> <A1E5B0EC-FAD5-4178-8C7B-540BEB61DC06@mnot.net> <54AEB660.1020701@intertwingly.net> <F122ADA8-4A96-4F88-BB9F-3C5C6A544067@mnot.net> <54C84872.5040902@intertwingly.net> <EF1E36FA-6A30-4A65-9520-5A31571EE445@mnot.net>
In-Reply-To: <EF1E36FA-6A30-4A65-9520-5A31571EE445@mnot.net>
Content-Type: text/plain; charset="windows-1252"; format="flowed"
Content-Transfer-Encoding: 7bit
X-Provags-ID: V03:K0:jjRK17Y1vfVO8u+UoCsCNbYAhHxuGPQM7ajlOnpnYtLPF0lFOqU s5Po6HuiiADu2zKM3bQtbWWEQRSCZexsfQ0yhFaZAP+8aQEdnz/PDaZKb7glt54fC+WPOwu RweP93uPQV7rZYlUpFHlRHAkn7Upf7zXVtp13YqcNq7x5bXkqfdBLn7nFAVQ90gzyK75YiG eubLZy3r+1WWZVH4yjTEA==
X-UI-Out-Filterresults: notjunk:1;
Archived-At: <http://mailarchive.ietf.org/arch/msg/apps-discuss/ajRlmKAc99n7A2QAmpH3CtTmUvw>
Cc: IETF Apps Discuss <apps-discuss@ietf.org>
Subject: Re: [apps-discuss] Fun with URLs and regex
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss/>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Jan 2015 21:15:07 -0000

On 2015-01-28 06:36, Mark Nottingham wrote:
>
>> On 28 Jan 2015, at 1:24 pm, Sam Ruby <rubys@intertwingly.net> wrote:
>>
>> On 1/27/15 8:53 PM, Mark Nottingham wrote:
>>> Hi Sam,
>>>
>>>> On 9 Jan 2015, at 3:54 am, Sam Ruby <rubys@intertwingly.net> wrote:
>>>>
>>>> Mark cares about valid URIs.  He's certainly not alone in that.  What he has done is express his interests not merely in high level prose, but in concrete, executable form.  Given that he has done that, I can pose some interesting questions.  For example, if you consider the process of canonicalizing a href value on an <a> element and stringifying the result, an implementation like Chrome will produce something that will be sent across the wire.  I've captured the results here:
>>>>
>>>> https://raw.githubusercontent.com/webspecs/url/develop/evaluate/useragent-results/chrome
>>>>
>>>> Given that data and Mark's script, I can produce a list of outputs that Mark doesn't consider valid:
>>>
>>> [...]
>>>
>>>> With this data, we can have a discussion as to whether Mark's script should be updated, or Chrome should change, or some spec should change.
>>>
>>> What I was hoping for was an update of the "valid URI" filter to take this into account at <https://url.spec.whatwg.org/interop/test-results/?filter=valid>.
>>>
>>> E.g., that list currently includes test case 63, "https:/example.com/", which is valid according to the generic syntax in RFC3986, but not when you consider the scheme-specific constraints for HTTPS in RFC7230.
>>>
>>> By filtering out these cases, we can see the places we potentially need to pay attention to in the RFCs.
>>>
>>> Is that possible?
>>
>> Since that code runs filters on the browser, it would be easiest for me integrate code written in JavaScript.
>>
>> Can you review the generated JavaScript?
>>
>> https://url.spec.whatwg.org/reference-implementation/uri-validate.js
>> https://url.spec.whatwg.org/reference-implementation/uri-validate.html
>
> It looks like a reasonable transcription. I do notice you copied my error:
>
> return new RegExp("^" + known[scheme] + "(#|$)").test(string)
>
> I think it should just be:
>
> return new RegExp("^" + known[scheme] + "$").test(string)
>
> ... which brings about another interesting observation -- only http and https define fragments in their syntax; the other schemes do not.
> ...


It's because you asked for that in 
<https://lists.w3.org/Archives/Public/ietf-http-wg/2013AprJun/0187.html>, and 
apparently were successful in convincing Roy.

(I still disagree with this outcome :-)

Best regards, Julian