Re: [apps-discuss] Fun with URLs and regex

Sam Ruby <rubys@intertwingly.net> Wed, 28 January 2015 21:51 UTC

Return-Path: <rubys@intertwingly.net>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 741751A037F for <apps-discuss@ietfa.amsl.com>; Wed, 28 Jan 2015 13:51:09 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id iU0dnLAqIYoM for <apps-discuss@ietfa.amsl.com>; Wed, 28 Jan 2015 13:51:08 -0800 (PST)
Received: from cdptpa-oedge-vip.email.rr.com (cdptpa-outbound-snat.email.rr.com [107.14.166.225]) by ietfa.amsl.com (Postfix) with ESMTP id DE8C41A012D for <apps-discuss@ietf.org>; Wed, 28 Jan 2015 13:51:07 -0800 (PST)
Received: from [98.27.51.253] ([98.27.51.253:23839] helo=rubix) by cdptpa-oedge03 (envelope-from <rubys@intertwingly.net>) (ecelerity 3.5.0.35861 r(Momo-dev:tip)) with ESMTP id 74/A7-26647-AC959C45; Wed, 28 Jan 2015 21:51:07 +0000
Received: from [192.168.1.102] (unknown [192.168.1.102]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: rubys) by rubix (Postfix) with ESMTPSA id AF83A140AF3; Wed, 28 Jan 2015 16:51:07 -0500 (EST)
Message-ID: <54C959C9.2090002@intertwingly.net>
Date: Wed, 28 Jan 2015 16:51:05 -0500
From: Sam Ruby <rubys@intertwingly.net>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: "Roy T. Fielding" <fielding@gbiv.com>, Julian Reschke <julian.reschke@gmx.de>
References: <C5B10293-E6F6-4348-9782-C9C00A4476CE@mnot.net> <CACweHNBVOrVMesB7HOjPNHe5FtzL1k9XDGAHUXAx5DbOSYv5jA@mail.gmail.com> <A1E5B0EC-FAD5-4178-8C7B-540BEB61DC06@mnot.net> <54AEB660.1020701@intertwingly.net> <F122ADA8-4A96-4F88-BB9F-3C5C6A544067@mnot.net> <54C84872.5040902@intertwingly.net> <EF1E36FA-6A30-4A65-9520-5A31571EE445@mnot.net> <54C95132.2060402@gmx.de> <154ABFBB-AB8C-447A-89A3-D1746EFBF1C6@gbiv.com>
In-Reply-To: <154ABFBB-AB8C-447A-89A3-D1746EFBF1C6@gbiv.com>
Content-Type: text/plain; charset="windows-1252"; format="flowed"
Content-Transfer-Encoding: 7bit
X-RR-Connecting-IP: 107.14.168.142:25
X-Cloudmark-Score: 0
Archived-At: <http://mailarchive.ietf.org/arch/msg/apps-discuss/R5UNb-AXxzD_r-bYqZXSxzyNkPc>
Cc: Mark Nottingham <mnot@mnot.net>, IETF Apps Discuss <apps-discuss@ietf.org>
Subject: Re: [apps-discuss] Fun with URLs and regex
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss/>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Jan 2015 21:51:09 -0000

On 01/28/2015 04:40 PM, Roy T. Fielding wrote:
> On Jan 28, 2015, at 1:14 PM, Julian Reschke wrote:
>> On 2015-01-28 06:36, Mark Nottingham wrote:
>>> ... which brings about another interesting observation -- only http and https define fragments in their syntax; the other schemes do not.
>>> ...
>>
>> It's because you asked for that in <https://lists.w3.org/Archives/Public/ietf-http-wg/2013AprJun/0187.html>, and apparently were successful in convincing Roy.
>
> It is a very very very old debate regarding whether the fragment is part
> of a URI or something attached to the end of a URI, but that was resolved
> in RFC3986 (since the only thing that really matters here is that a fragment
> is going to be parsed as such regardless of the scheme).
>
> HTTP was merely updated to reflect what STD66 calls a URI.

Based on this discussion, I am gathering that the correct way to 
validate a URI with a known scheme is as follows:

   return new RegExp("^" + known[scheme] + "($|#" + fragment + 
")").test(string)

Anybody care to confirm or deny?

The full script is available here:

https://url.spec.whatwg.org/reference-implementation/uri-validate.js

The function in question is at the bottom of this script.

> ....Roy

- Sam Ruby