Re: [urn] URNs are not URIs (another look at RFC 3986)

worley@ariadne.com (Dale R. Worley) Thu, 17 April 2014 19:49 UTC

Return-Path: <worley@ariadne.com>
X-Original-To: urn@ietfa.amsl.com
Delivered-To: urn@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0325E1A0054 for <urn@ietfa.amsl.com>; Thu, 17 Apr 2014 12:49:23 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.001
X-Spam-Level:
X-Spam-Status: No, score=-0.001 tagged_above=-999 required=5 tests=[BAYES_20=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id h4cQ8psB2iAq for <urn@ietfa.amsl.com>; Thu, 17 Apr 2014 12:49:19 -0700 (PDT)
Received: from qmta03.westchester.pa.mail.comcast.net (qmta03.westchester.pa.mail.comcast.net [IPv6:2001:558:fe14:43:76:96:62:32]) by ietfa.amsl.com (Postfix) with ESMTP id 8377B1A0011 for <urn@ietf.org>; Thu, 17 Apr 2014 12:49:19 -0700 (PDT)
Received: from omta05.westchester.pa.mail.comcast.net ([76.96.62.43]) by qmta03.westchester.pa.mail.comcast.net with comcast id qzvW1n0030vyq2s537pFrl; Thu, 17 Apr 2014 19:49:15 +0000
Received: from hobgoblin.ariadne.com ([24.34.72.61]) by omta05.westchester.pa.mail.comcast.net with comcast id r7pF1n00D1KKtkw3R7pFuX; Thu, 17 Apr 2014 19:49:15 +0000
Received: from hobgoblin.ariadne.com (hobgoblin.ariadne.com [127.0.0.1]) by hobgoblin.ariadne.com (8.14.7/8.14.7) with ESMTP id s3HJnFbg005147 for <urn@ietf.org>; Thu, 17 Apr 2014 15:49:15 -0400
Received: (from worley@localhost) by hobgoblin.ariadne.com (8.14.7/8.14.7/Submit) id s3HJnE6R005146; Thu, 17 Apr 2014 15:49:14 -0400
Date: Thu, 17 Apr 2014 15:49:14 -0400
Message-Id: <201404171949.s3HJnE6R005146@hobgoblin.ariadne.com>
From: worley@ariadne.com
Sender: worley@ariadne.com
To: urn@ietf.org
In-reply-to: <001976FFC9FE8FFCAA2E7990@JCK-EEE10> (john-ietf@jck.com)
References: <C93A34DBE97565AD96CEC321@JcK-HP8200.jck.com> <CAMm+Lwia99RdyO4RFScSwCaVHLsr_BRzmXK18eUoxGFti79Vog@mail.gmail.com> <001976FFC9FE8FFCAA2E7990@JCK-EEE10>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20140121; t=1397764155; bh=Vkz7aynBuNqzW6q+EpnuEoRDs/zmunU2QqLy9IGeCos=; h=Received:Received:Received:Received:Date:Message-Id:From:To: Subject; b=C5ED7YjWtm1E5i7Eq2rl9jvvtmShN8K4+avbjyJAu7isBTzG50J1HDNYwL5QpUxOF 7Z+JYza1AspUGGlAy8rVOqIWzWKIhufw4E/TFKzRzrmdK5dEMJ/b73rrn2/1RpAxbV qzqrC+EGxqoeVzoHxhRW2vPlkadJj8ofgEhAKiyWxUmFuO9peMv1AUOBOLKcdxga/O btZxNtU2ptU/HTX4NTmfJWfqpl3j4mcPrkwxRytGLwZySBHdZeT+YgxM+nlTF6gmuG 4ZN2k6Pr+NrdCavBy35PbSzehEOW9rM9fUGAHqFods8jS3uujtRTQ2CWUhyOxUH/3+ Mlkxr9BYPi9Vg==
Archived-At: http://mailarchive.ietf.org/arch/msg/urn/mI-S1gXGFEj4ukZKKPmRHR6UJs8
Subject: Re: [urn] URNs are not URIs (another look at RFC 3986)
X-BeenThere: urn@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Revisions to URN RFCs <urn.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/urn>, <mailto:urn-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/urn/>
List-Post: <mailto:urn@ietf.org>
List-Help: <mailto:urn-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/urn>, <mailto:urn-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 17 Apr 2014 19:49:23 -0000

One the other hand, if we are considering URNs as they are now
constituted, then I can give some specific comments:

draft-ietf-urnbis-urns-are-not-uris-00 is written in an unusual
format:  it gives a good list of issues, gives a sketch of discussion
and reasoning, and ends by stating a very specific solution.  So I was
rather confused as to exactly how I-D was to be interpreted -- Is it
the basis from which a discussion will be made?  (If so, why does it
specify the conclusion?)  Is it the final document?  (If so, why does
it describe the issues, but not give the reasoning which leads to the
stated solution?)  After seeing the ensuing discussion on the mailing
list, it's clearer what the purpose of the I-D is, and I'd like to
present my opinions again in a better-organized way.

First, let me enumerate some secondary issues that would clutter the
discussion of the main points:

I conceive that URLs and URNs are two subsets of URIs.  In principle,
they may overlap, and also there are URIs that are neither URLs or
URNs (e.g., the "tag" scheme).  In practice, I see neither of these
facts as being significant, as there are no practical consequences of
either fact, beyond that specific schemes may be defined to be URLs
and/or URNs.  (This is the "Contempory View" of
http://www.w3.org/TR/2001/NOTE-uri-clarification-20010921/, but I
arrived at it before reading that document.)

Currently, all URNs are in the scheme "urn".  But it has never been
specified that all URNs must be of the "urn" scheme.  The lack of
clarity on this point may lead to people making the assumption that
because a URI is a URN, it must be of the "urn" scheme.

I can see the philosophical distinction between URNs and URLs, but I
don't see that that necessitates that they are handled differently at
the protocol level.  This is despite that I'm a mathematician by
training and have a strong sense that the proper structuring of
systems is aided by having clearly organized concepts.  In particular,
both URLs and URNs can be used for the operation of "designating a
resource".  Generic "designating a resource" can be useful in at least
two situations:  (1) when the use of the URI is opaque, and much of
the processing in question can be done without consideration of what
the resource is in particular, and (2) when the processor can resolve
the URN into a concrete representation of the resource.  In the latter
case, the URN allows the processor to access an object in the much
same way as a URL, and URN acts as a subclass of URL.  As RFC 2141
says, "The URN syntax has been defined so that URNs can be used in
places where URLs are expected."

If we are seriously concerned with persistence of objects, we should
standardize on the most-proven technology available, viz., translating
the object into Sumerian, transcribing it into cuneiform on a clay
tablet, firing the tablet, and then burying it in suitable dry soil.
In particular, I haven't seen any reference to "Ozymandias" being made
persistent in this way.  However, that should be put into a different
I-D, as it is out of scope for this one.

-----

Getting back to the main issues:

Looking at RFC 3986, I see that despite its title "Uniform Resource
Identifier (URI): Generic Syntax", it does contain certain
specifications of the generic semantics of URIs, and these
specifications have some consequences for defining URNs.  (The fact
that these consequences exist is essentially admitting that the
concerns of http://www.w3.org/DesignIssues/ModelConsequences must be
met.  Again, I had come to these conclusions before reading that
document.)

1) Paths:  the use of '/', '.', '..', and how relative URIs are
resolved into absolute URIs

These rules describe how relative URIs are to be interpreted, and in
doing so, specify that a URI is divided into "segments" by '/', and
how a new absolute URI is assembled from the segments of a base URI
and the relative URI.  As a consequence of this process, the use of
'.' and '..' as segments must be avoided in absolute URIs.

I don't see this as significantly restrictive of URNs -- If a URN
scheme wants to take advantage of the relative URI mechanism, it can
do so by conforming to the generic syntax, and if it does not want to
use the relative URI mechanism, it can avoid using '/'.

2) Query:  the use of '?'

A URI containing a query part is related to the URI created by
deleting the query part in some manner, but the manner seems to be
entirely left for definition by the scheme definition:

   The query component contains non-hierarchical data that, along with
   data in the path component (Section 3.3), serves to identify a
   resource within the scope of the URI's scheme and naming authority
   (if any).

That appears to me to be a non-constraint in any practical sense.

3) Fragment:  the use of '#'

The use of the fragment part has much more semantic content:

   The fragment identifier component of a URI allows indirect
   identification of a secondary resource by reference to a primary
   resource and additional identifying information.  The identified
   secondary resource may be some portion or subset of the primary
   resource, some view on representations of the primary resource, or
   some other resource defined or described by those representations.

   The semantics of a fragment identifier are defined by the set of
   representations that might result from a retrieval action on the
   primary resource.  The fragment's format and resolution is therefore
   dependent on the media type [RFC2046] of a potentially retrieved
   representation, even though such a retrieval is only performed if the
   URI is dereferenced.  If no such representation exists, then the
   semantics of the fragment are considered unknown and are effectively
   unconstrained.  Fragment identifier semantics are independent of the
   URI scheme and thus cannot be redefined by scheme specifications.

   Individual media types may define their own restrictions on or
   structures within the fragment identifier syntax for specifying
   different types of subsets, views, or external references that are
   identifiable as secondary resources by that media type.  If the
   primary resource has multiple representations, as is often the case
   for resources whose representation is selected based on attributes of
   the retrieval request (a.k.a., content negotiation), then whatever is
   identified by the fragment should be consistent across all of those
   representations.  Each representation should either define the
   fragment so that it corresponds to the same secondary resource,
   regardless of how it is represented, or should leave the fragment
   undefined (i.e., not found).

In short, the full process of dereferencing a URI must be factorable
into three phases:

   - dereference the URI with the fragment part removed to provide a
     set of representations
   - select one of the representations
   - from or based on the selected representation, derive the fragment
     part

and while the first phase is scheme-dependent, the third phase may
only depend on the chosen representation and its media type.

The degree of constraint this places on URNs is not clear to me.  If
one wishes a URN-without-fragment to designate a resource whose
representation is a media type whose fragment-access is already
defined, the URN is constrained.  (I know that fragment-access is
defined for HTML documents; is it defined for any other media type?)
If one is free to have one's URN-without-fragment designate a new
media type, the semantics of the fragment part is nearly unlimited, as
long as the base resource representation contains all the information
needed for fragment resolution, as specified by its media type.

4) Syntactic compatibility

A question which seems to me to be getting insufficient attention is
that of semantic compatibility of all URIs, that is, that all current
and future URIs should conform to the currently set syntax for URIs.
This is required for upward-compatibility with current systems that
validate URIs for syntactic conformity.

In regard to this, it seems to me to be undesirable to decouple the
syntax specification of URNs (or rather, the "urn" scheme) from RFC
3896 -- because we have a de-facto requirement that URNs remain within
3896, formally disconnecting the two will lose formal specification of
this requirement.

However, there is a caveat (vide Phillip Hallam-Baker's remarks):

    [That] may not be backward-compatible with the specification, but it
    is backward-compatible with reality.  -- Francois Audet

There is a very real question of the degree to which any existing
system validates data that are considered to be "generic URIs".  If
systems in practice don't validate URIs beyond "having a scheme", then
we are free to update the URI syntax (and consequently the URN syntax)
very broadly.

Similarly, we have to worry about changes to the syntax of the "urn"
scheme.  RFC 2141 states that '/', '?', and '#' are "reserved for
particular purposes".  But in fact, they don't appear in the BNF, so
the "urn" definition can't be expanded to include them without risking
breaking any software that validates URNs against the BNF of 2141.

Again, the degree to which this a practical problem is not clear.

5) Equality testing

One feature of RFC 3986 that may have turned out to be a bad idea is
that testing for equality of two URIs cannot be done without specific
information regarding their scheme.  3986 does define that certain
URIs (that is, certain character sequences that conform to the BNF)
must be "the same" (and thus have equivalent functionality for all
purposes).  But each scheme is permitted to define equality in a
coarser sense, that is, to combine those groups of equal URIs into
larger groups.

This makes it impossible for a processor to index something based on
URIs that does not have specific knowledge of the URI schemes
involved, but still handles all equal URIs in the same way.  However,
it's not clear how much that matters in practice -- most URI schemes
have de-facto canonical forms.

It's also not clear that this situation can be avoided if we want to
regularly incorporate pre-existing identifier systems as URI schemes,
as other identifier systems frequently have their own rules for
identifier equality.

Dale