Re: [apps-discuss] Fun with URLs and regex

Matthew Kerwin <> Thu, 29 January 2015 05:14 UTC

Return-Path: <>
Received: from localhost ( []) by (Postfix) with ESMTP id 9B8491A1BA5 for <>; Wed, 28 Jan 2015 21:14:17 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.027
X-Spam-Status: No, score=-1.027 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, FM_FORGED_GMAIL=0.622, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id JXqozGt9AKxU for <>; Wed, 28 Jan 2015 21:14:15 -0800 (PST)
Received: from ( [IPv6:2607:f8b0:400d:c04::22a]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 793071A1BAE for <>; Wed, 28 Jan 2015 21:14:12 -0800 (PST)
Received: by with SMTP id q107so23952565qgd.1 for <>; Wed, 28 Jan 2015 21:14:11 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=5zAy/iLX20LbM/qY5e/WcToO57fuC/tIr3ENeXoXrBQ=; b=x3BSM8r8hYghYhsw5rwPTZSw/tXjbHOFSCq1LIHr+AhN6GF3qU/UzLHJMqKCht8AM/ Ox6ep07Ozou9ZBh30V8/JpG3Tcw+vivnXd/Yg+mjrYQVAI+rB1Jy5Vzc1GVVrHWKi1xf vxPaoOS8Pug3Y/7TQlDDS/BocPEjE4zWVpVdnTyeAaDyoUomCHjLWlDh/bWPZzXQUEvF 0VOZD6XuzTWV7D6UR5EutF6Owpjau8ImM6I4opcisuGjOZGK4FJ+BIJ9STVxGiGDptFi 3PVEeLqL4zCmM18x4eMNa7J19f22uBllncZ/3D4q7rx6b5MNPZyKS5Sfn4vY2k+cQQy8 5PCQ==
MIME-Version: 1.0
X-Received: by with SMTP id 65mr192813qgx.24.1422508451583; Wed, 28 Jan 2015 21:14:11 -0800 (PST)
Received: by with HTTP; Wed, 28 Jan 2015 21:14:11 -0800 (PST)
In-Reply-To: <>
References: <> <> <> <> <> <> <> <> <> <> <> <>
Date: Thu, 29 Jan 2015 15:14:11 +1000
X-Google-Sender-Auth: vJ7tK5mFSQ1sxvnYMtxoeqBzEVM
Message-ID: <>
From: Matthew Kerwin <>
To: "Roy T. Fielding" <>
Content-Type: multipart/alternative; boundary="001a11c15694f27aaa050dc38ef7"
Archived-At: <>
Cc: IETF Apps Discuss <>
Subject: Re: [apps-discuss] Fun with URLs and regex
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: General discussion of application-layer protocols <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Thu, 29 Jan 2015 05:14:17 -0000

On 29 January 2015 at 11:24, Roy T. Fielding <> wrote:

> On Jan 28, 2015, at 3:59 PM, Matthew Kerwin wrote:
> >
> > Whether or not I mention it comes back to the definition and intended
> > use-case of RFC 3986; if it defines an 'abstract' syntax - in the POO
> > sense - then there's no such thing as a universal parser (i.e. It's
> > impossible to parse a URI with an unknown scheme). If it defines a
> > low-level structure, then any URI can be parsed, but the individual
> > components can't be validated without deferring to scheme-specific
> > machinery.
> >
> > If the former and I don't include the fragment in 'file', it isn't
> > allowed. If the latter, I just leave a hole in the spec.
> It isn't that black and white.  The grammar for the scheme is what
> excludes a fragment.  That doesn't prevent the scheme docs from
> talking about fragments (in reference to RFC3986) and using them
> within examples.
> ​
> ​
> ​
> ​
> ​​
> That part of the URI spec was written specifically to address
> issues created by folks who thought they could redefine the meaning
> of fragments within individual schemes, or forbid them entirely,
> when in fact the meaning and use of fragments are independent of
> scheme.
> ​
> ​
> ​​

Poking around again, I just saw the line "Fragment identifier semantics are
independent of the URI scheme and thus cannot be redefined by scheme
specifications." So yes, I think I now understand where everyone is coming
from in the discussion, and I also think I understand RFC 3986 less than I
did before. Time for more reading...

RFC 3986 "defines a grammar that is a superset of all valid URIs, allowing
an implementation to parse the common components of a URI reference without
knowing the scheme-specific requirements of every possible identifier",
that's clear enough - it's not an abstract grammar. The twist that I'd
missed was that, of the resulting components, not all of them are inputs to
the individual schemes; the 'fragment' component feeds into the content
handler directly.

I'm still suffering a misalignment: RFC 3986 defines the whole generic URI
syntax as:

    URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

and my draft that references it essentially defines (or will soon define)
the whole file-URI syntax as:

    file-URI = subset-of-scheme ":" subset-of-hier-part

By leaving off the query part, I've either said that a URI with a query
part cannot be a 'file' URI, or that a URI that starts with "file:" and has
a query part is invalid. Potayto potahto.

I don't know what I've said by leaving off the fragment part. According to
RFC 3986 I'm not allowed to touch the it, but now I have a collected ABNF
for 'file' URIs that doesn't have fragments. I think my way forward is to
include the fragment part in the grammar, and then deflect it the way 3986,
7230, etc. do.  "The fragment's format and resolution is ... dependent on
the media type of a potentially retrieved representation, even though such
a retrieval is only performed if the URI is dereferenced. ... The [client]
MAY either assume a media type of "application/octet-stream" or examine the
data to determine its type." Or something to similar effect. Does that seem
right to you? How do other representation-retrieving schemes deal with it?

> What makes you think that dereferenced files don't have a well-defined
> content type?  The client might not know what it is, but that doesn't
> mean the content type doesn't exist, and any decision to process the
> file is basically an assumption of some content type (and its rules
> for processing fragments).
Perhaps I s​hould have said "well known" instead of "well-defined," or that
the *means of determining* the content type is not well defined.

Back to Sam's library:

> Based on this discussion, I am gathering that the correct way to validate
a URI with a known scheme is as follows:
>   return new RegExp("^" + known[scheme] + "($|#" + fragment +
> Anybody care to confirm or deny?

What with all the reading and thinking I've just done, I suppose that's
right. As long as all the known schemes don't have their own #fragment bits
(http_URI does).

Aesthetically it's strange way to denote an optional fragment, but it gets
there in the end.

  Matthew Kerwin