Re: [apps-discuss] Fun with URLs and regex

t.petch <> Thu, 29 January 2015 10:49 UTC

Return-Path: <>
Received: from localhost ( []) by (Postfix) with ESMTP id E9F0C1A0151 for <>; Thu, 29 Jan 2015 02:49:53 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.901
X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_PASS=-0.001] autolearn=ham
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id ZDF0h89s7tsJ for <>; Thu, 29 Jan 2015 02:49:52 -0800 (PST)
Received: from ( [IPv6:2a01:111:f400:fe00::797]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id D85071A01F4 for <>; Thu, 29 Jan 2015 02:49:49 -0800 (PST)
Received: from pc6 ( by ( with Microsoft SMTP Server (TLS) id; Thu, 29 Jan 2015 10:49:26 +0000
Message-ID: <01f101d03bb1$0eee68e0$>
From: "t.petch" <>
To: Matthew Kerwin <>, "Roy T. Fielding" <>
References: <> <> <> <> <> <> <> <> <> <> <> <> <>
Date: Thu, 29 Jan 2015 10:46:09 +0000
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2800.1106
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106
X-Originating-IP: []
X-ClientProxiedBy: ( To (
Authentication-Results:; dkim=none (message not signed) header.d=none;; dmarc=none action=none;
X-DmarcAction-Test: None
X-Microsoft-Antispam: UriScan:;
X-Microsoft-Antispam: BCL:0;PCL:0;RULEID:(3005004);SRVR:DB3PR07MB060;
X-Exchange-Antispam-Report-Test: UriScan:;
X-Exchange-Antispam-Report-CFA-Test: BCL:0; PCL:0; RULEID:(601004); SRVR:DB3PR07MB060;
X-Forefront-PRVS: 0471B73328
X-Forefront-Antispam-Report: SFV:NSPM; SFS:(10019020)(6009001)(51704005)(13464003)(24454002)(377454003)(1456003)(92566002)(93886004)(44736004)(76176999)(81686999)(81816999)(50226001)(47776003)(66066001)(42186005)(77096005)(50466002)(116806002)(50986999)(33646002)(77156002)(15975445007)(62966003)(40100003)(122386002)(84392001)(19580405001)(23676002)(14496001)(61296003)(46102003)(62236002)(44716002)(86362001)(74416001)(7726001); DIR:OUT; SFP:1102; SCL:1; SRVR:DB3PR07MB060; H:pc6; FPR:; SPF:None; MLV:sfv; LANG:en;
X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:;SRVR:DB3PR07MB060;
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 29 Jan 2015 10:49:26.0223 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB3PR07MB060
Archived-At: <>
Cc: IETF Apps Discuss <>
Subject: Re: [apps-discuss] Fun with URLs and regex
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: General discussion of application-layer protocols <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Thu, 29 Jan 2015 10:49:54 -0000

----- Original Message -----
From: "Matthew Kerwin" <>
To: "Roy T. Fielding" <>
Cc: "IETF Apps Discuss" <>
Sent: Thursday, January 29, 2015 5:14 AM

On 29 January 2015 at 11:24, Roy T. Fielding <> wrote:

> It isn't that black and white.  The grammar for the scheme is what
> excludes a fragment.  That doesn't prevent the scheme docs from
> talking about fragments (in reference to RFC3986) and using them
> within examples.
> That part of the URI spec was written specifically to address
> issues created by folks who thought they could redefine the meaning
> of fragments within individual schemes, or forbid them entirely,
> when in fact the meaning and use of fragments are independent of
> scheme.

Poking around again, I just saw the line "Fragment identifier semantics
independent of the URI scheme and thus cannot be redefined by scheme
specifications." So yes, I think I now understand where everyone is
from in the discussion, and I also think I understand RFC 3986 less than
did before. Time for more reading...

RFC 3986 "defines a grammar that is a superset of all valid URIs,
an implementation to parse the common components of a URI reference
knowing the scheme-specific requirements of every possible identifier",
that's clear enough - it's not an abstract grammar. The twist that I'd
missed was that, of the resulting components, not all of them are inputs
the individual schemes; the 'fragment' component feeds into the content
handler directly.

I'm still suffering a misalignment: RFC 3986 defines the whole generic
syntax as:

    URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

and my draft that references it essentially defines (or will soon
the whole file-URI syntax as:

    file-URI = subset-of-scheme ":" subset-of-hier-part

By leaving off the query part, I've either said that a URI with a query
part cannot be a 'file' URI, or that a URI that starts with "file:" and
a query part is invalid. Potayto potahto.


When I read your I-D, I assumed that you knew what you were doing:-)  I
can see a use for 'query' but if noone implements it, then better the
I-D left it out.  But I did have it as a point to pursue, once the
bigger issue, to me, of splitting what is valid according to RFC3986
from what is not and seeing that reflected in the I-D (and of course,
there is the mini-charter to agree:-(

Tom Petch


I don't know what I've said by leaving off the fragment part. According
RFC 3986 I'm not allowed to touch the it, but now I have a collected
for 'file' URIs that doesn't have fragments. I think my way forward is
include the fragment part in the grammar, and then deflect it the way
7230, etc. do.  "The fragment's format and resolution is ... dependent
the media type of a potentially retrieved representation, even though
a retrieval is only performed if the URI is dereferenced. ... The
MAY either assume a media type of "application/octet-stream" or examine
data to determine its type." Or something to similar effect. Does that
right to you? How do other representation-retrieving schemes deal with

> What makes you think that dereferenced files don't have a well-defined
> content type?  The client might not know what it is, but that doesn't
> mean the content type doesn't exist, and any decision to process the
> file is basically an assumption of some content type (and its rules
> for processing fragments).
Perhaps I s​hould have said "well known" instead of "well-defined," or
the *means of determining* the content type is not well defined.

Back to Sam's library:

> Based on this discussion, I am gathering that the correct way to
a URI with a known scheme is as follows:
>   return new RegExp("^" + known[scheme] + "($|#" + fragment +
> Anybody care to confirm or deny?

What with all the reading and thinking I've just done, I suppose that's
right. As long as all the known schemes don't have their own #fragment
(http_URI does).

Aesthetically it's strange way to denote an optional fragment, but it
there in the end.

  Matthew Kerwin