Re: [apps-discuss] Fun with URLs and regex

t.petch <ietfc@btconnect.com> Thu, 29 January 2015 10:49 UTC

Return-Path: <ietfc@btconnect.com>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E9F0C1A0151 for <apps-discuss@ietfa.amsl.com>; Thu, 29 Jan 2015 02:49:53 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.901
X-Spam-Level:
X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ZDF0h89s7tsJ for <apps-discuss@ietfa.amsl.com>; Thu, 29 Jan 2015 02:49:52 -0800 (PST)
Received: from emea01-am1-obe.outbound.protection.outlook.com (mail-am1on0797.outbound.protection.outlook.com [IPv6:2a01:111:f400:fe00::797]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D85071A01F4 for <apps-discuss@ietf.org>; Thu, 29 Jan 2015 02:49:49 -0800 (PST)
Received: from pc6 (81.151.167.59) by DB3PR07MB060.eurprd07.prod.outlook.com (10.242.137.151) with Microsoft SMTP Server (TLS) id 15.1.65.19; Thu, 29 Jan 2015 10:49:26 +0000
Message-ID: <01f101d03bb1$0eee68e0$4001a8c0@gateway.2wire.net>
From: "t.petch" <ietfc@btconnect.com>
To: Matthew Kerwin <matthew@kerwin.net.au>, "Roy T. Fielding" <fielding@gbiv.com>
References: <C5B10293-E6F6-4348-9782-C9C00A4476CE@mnot.net> <CACweHNBVOrVMesB7HOjPNHe5FtzL1k9XDGAHUXAx5DbOSYv5jA@mail.gmail.com> <A1E5B0EC-FAD5-4178-8C7B-540BEB61DC06@mnot.net> <54AEB660.1020701@intertwingly.net> <F122ADA8-4A96-4F88-BB9F-3C5C6A544067@mnot.net> <54C84872.5040902@intertwingly.net> <EF1E36FA-6A30-4A65-9520-5A31571EE445@mnot.net> <54C95132.2060402@gmx.de> <154ABFBB-AB8C-447A-89A3-D1746EFBF1C6@gbiv.com> <54C95AF7.6030703@gmx.de> <CACweHNBHiEGUwLB3z6YoTexF=b9ApwsUy6-DVCf9vnBSD+L5Rw@mail.gmail.com> <E6AB5A9F-D1DF-45A2-AAEF-FCF2752FD254@gbiv.com> <CACweHNAitEigzDkxOrnR9fkCeMG=ft8g6cVvpmtBrPMMp9xOeA@mail.gmail.com>
Date: Thu, 29 Jan 2015 10:46:09 +0000
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2800.1106
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106
X-Originating-IP: [81.151.167.59]
X-ClientProxiedBy: DB3PR05CA0031.eurprd05.prod.outlook.com (25.160.41.159) To DB3PR07MB060.eurprd07.prod.outlook.com (10.242.137.151)
Authentication-Results: kerwin.net.au; dkim=none (message not signed) header.d=none; kerwin.net.au; dmarc=none action=none header.from=btconnect.com;
X-DmarcAction-Test: None
X-Microsoft-Antispam: UriScan:;
X-Microsoft-Antispam: BCL:0;PCL:0;RULEID:(3005004);SRVR:DB3PR07MB060;
X-Exchange-Antispam-Report-Test: UriScan:;
X-Exchange-Antispam-Report-CFA-Test: BCL:0; PCL:0; RULEID:(601004); SRVR:DB3PR07MB060;
X-Forefront-PRVS: 0471B73328
X-Forefront-Antispam-Report: SFV:NSPM; SFS:(10019020)(6009001)(51704005)(13464003)(24454002)(377454003)(1456003)(92566002)(93886004)(44736004)(76176999)(81686999)(81816999)(50226001)(47776003)(66066001)(42186005)(77096005)(50466002)(116806002)(50986999)(33646002)(77156002)(15975445007)(62966003)(40100003)(122386002)(84392001)(19580405001)(23676002)(14496001)(61296003)(46102003)(62236002)(44716002)(86362001)(74416001)(7726001); DIR:OUT; SFP:1102; SCL:1; SRVR:DB3PR07MB060; H:pc6; FPR:; SPF:None; MLV:sfv; LANG:en;
X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:;SRVR:DB3PR07MB060;
X-OriginatorOrg: btconnect.com
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 29 Jan 2015 10:49:26.0223 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB3PR07MB060
Archived-At: <http://mailarchive.ietf.org/arch/msg/apps-discuss/YsxJZqlfNYml4ocGokW6lnC2_O0>
Cc: IETF Apps Discuss <apps-discuss@ietf.org>
Subject: Re: [apps-discuss] Fun with URLs and regex
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss/>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 29 Jan 2015 10:49:54 -0000

----- Original Message -----
From: "Matthew Kerwin" <matthew@kerwin.net.au>
To: "Roy T. Fielding" <fielding@gbiv.com>
Cc: "IETF Apps Discuss" <apps-discuss@ietf.org>
Sent: Thursday, January 29, 2015 5:14 AM

On 29 January 2015 at 11:24, Roy T. Fielding <fielding@gbiv.com> wrote:

>
> It isn't that black and white.  The grammar for the scheme is what
> excludes a fragment.  That doesn't prevent the scheme docs from
> talking about fragments (in reference to RFC3986) and using them
> within examples.
> That part of the URI spec was written specifically to address
> issues created by folks who thought they could redefine the meaning
> of fragments within individual schemes, or forbid them entirely,
> when in fact the meaning and use of fragments are independent of
> scheme.

Poking around again, I just saw the line "Fragment identifier semantics
are
independent of the URI scheme and thus cannot be redefined by scheme
specifications." So yes, I think I now understand where everyone is
coming
from in the discussion, and I also think I understand RFC 3986 less than
I
did before. Time for more reading...

RFC 3986 "defines a grammar that is a superset of all valid URIs,
allowing
an implementation to parse the common components of a URI reference
without
knowing the scheme-specific requirements of every possible identifier",
that's clear enough - it's not an abstract grammar. The twist that I'd
missed was that, of the resulting components, not all of them are inputs
to
the individual schemes; the 'fragment' component feeds into the content
handler directly.

I'm still suffering a misalignment: RFC 3986 defines the whole generic
URI
syntax as:

    URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

and my draft that references it essentially defines (or will soon
define)
the whole file-URI syntax as:

    file-URI = subset-of-scheme ":" subset-of-hier-part

By leaving off the query part, I've either said that a URI with a query
part cannot be a 'file' URI, or that a URI that starts with "file:" and
has
a query part is invalid. Potayto potahto.

<tp>

When I read your I-D, I assumed that you knew what you were doing:-)  I
can see a use for 'query' but if noone implements it, then better the
I-D left it out.  But I did have it as a point to pursue, once the
bigger issue, to me, of splitting what is valid according to RFC3986
from what is not and seeing that reflected in the I-D (and of course,
there is the mini-charter to agree:-(

Tom Petch

</tp>

I don't know what I've said by leaving off the fragment part. According
to
RFC 3986 I'm not allowed to touch the it, but now I have a collected
ABNF
for 'file' URIs that doesn't have fragments. I think my way forward is
to
include the fragment part in the grammar, and then deflect it the way
3986,
7230, etc. do.  "The fragment's format and resolution is ... dependent
on
the media type of a potentially retrieved representation, even though
such
a retrieval is only performed if the URI is dereferenced. ... The
[client]
MAY either assume a media type of "application/octet-stream" or examine
the
data to determine its type." Or something to similar effect. Does that
seem
right to you? How do other representation-retrieving schemes deal with
it?


>
> What makes you think that dereferenced files don't have a well-defined
> content type?  The client might not know what it is, but that doesn't
> mean the content type doesn't exist, and any decision to process the
> file is basically an assumption of some content type (and its rules
> for processing fragments).
>
>
Perhaps I s​hould have said "well known" instead of "well-defined," or
that
the *means of determining* the content type is not well defined.


Back to Sam's library:

> Based on this discussion, I am gathering that the correct way to
validate
a URI with a known scheme is as follows:
>
>   return new RegExp("^" + known[scheme] + "($|#" + fragment +
")").test(string)
>
> Anybody care to confirm or deny?

What with all the reading and thinking I've just done, I suppose that's
right. As long as all the known schemes don't have their own #fragment
bits
(http_URI does).

Aesthetically it's strange way to denote an optional fragment, but it
gets
there in the end.

  Matthew Kerwin
  http://matthew.kerwin.net.au/