Re: [urn] Thoughts on fragments, queries, and new URN namespaces

John C Klensin <john-ietf@jck.com> Sat, 15 June 2013 13:16 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: urn@ietfa.amsl.com
Delivered-To: urn@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C426221F9C6C for <urn@ietfa.amsl.com>; Sat, 15 Jun 2013 06:16:14 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -101.949
X-Spam-Level:
X-Spam-Status: No, score=-101.949 tagged_above=-999 required=5 tests=[AWL=-0.037, BAYES_00=-2.599, FUZZY_VPILL=0.687, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id kFO1CSsjHxUB for <urn@ietfa.amsl.com>; Sat, 15 Jun 2013 06:16:08 -0700 (PDT)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51]) by ietfa.amsl.com (Postfix) with ESMTP id 8416921F9A7E for <urn@ietf.org>; Sat, 15 Jun 2013 06:16:08 -0700 (PDT)
Received: from [198.252.137.115] (helo=JcK-HP8200.jck.com) by bsa2.jck.com with esmtp (Exim 4.71 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1UnqKl-000GDY-2E; Sat, 15 Jun 2013 09:15:59 -0400
Date: Sat, 15 Jun 2013 09:15:53 -0400
From: John C Klensin <john-ietf@jck.com>
To: Keith Moore <moore@network-heretics.com>
Message-ID: <4A9225387F6E4CCB5BB1A018@JcK-HP8200.jck.com>
In-Reply-To: <51BB743B.2020007@network-heretics.com>
References: <93D12CA26D01683582E31B95@JcK-HP8200.jck.com> <51BA7AAB.4080301@network-heretics.com> <51BA9BCA.7080407@stpeter.im> <51BAA2B2.5010602@network-heretics.com> <B2CABDBAEC8551703DFD512F@JcK-HP8200.jck.com> <51BB743B.2020007@network-heretics.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Cc: urn@ietf.org
Subject: Re: [urn] Thoughts on fragments, queries, and new URN namespaces
X-BeenThere: urn@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Revisions to URN RFCs <urn.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/urn>, <mailto:urn-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/urn>
List-Post: <mailto:urn@ietf.org>
List-Help: <mailto:urn-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/urn>, <mailto:urn-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 15 Jun 2013 13:16:15 -0000

Keith,

It is starting to feel as if we are either reading different
documents with the same names and identifiers or that we are
somehow reading the same documents very differently.  Key
examples and some other discussion inline below.

Note to impatient readers: there is a proposal for specific text
(a proposed new section of 2141bis and update to 1737) at the
end of this over-long note.


--On Friday, June 14, 2013 15:51 -0400 Keith Moore
<moore@network-heretics.com> wrote:

>...
> 1.  RFC 2141 defined a type of identifier which it called
> Uniform Resource Names or URNs.   It also defined a syntax for
> URNs which happens to begin with 'urn:', and rules for
> creating and assigning and not reassigning URNs.   Granted,
> there had been discussions for years prior to that which used
> the term URN more loosely and/or which proposed different
> rules and different syntaxes, but 2141 represents a
> rough-consensus result of those long discussions aimed at
> understanding what URNs really should be.

My reading of draft-ietf-urnbis-rfc2141bis-urn-05 is entirely
consistent with that.  It excludes the other uses of the term
"URN" in favor of talking about 2141-type URNs, contains
explicit statements about minimal necessary chances from 2141,
etc.  Now, if I thought 3986 were as bogus as you apparently do,
or even found it mildly distasteful (which I do), and were
holding the pen on 2141bis rather than Peter, I'd probably try
to structure the introductory paragraphs to sound less like the
main reason for the document was to bring 2141 into conformance
with 3986.  But, even if Peter rephrased things that way, they
wouldn't change the spec significantly in most areas, only the
vocabulary used to describe it.

> 2. For various reasons, some people weren't happy with the URN
> syntax or framework that IETF decided on.   One result was
> that the authors of RFC 3969 tried to broaden the definition
> of URNs in that document.

I assume you mean "3986" in the above and a few places below
that you mention 3969.

Whether that characterization is accurate or not --it goes
almost without saying that some people (maybe exclusively the
same "some people", maybe not) would disagree-- 3986 has now
stood at a full Internet Standard for more than eight years
without any significant challenge to its validity as a
specification or claim to be an Internet Standard.  Given that,
I think URNBis and rfc2141bis are obligated to be consistent
with it, at least unless we want to make the claim that 2141
(and 2141bis) URNs are really not URIs.  I don't see any basis
in the URNBis charter for either challenging the "URNs are URIs"
assumption or for simply ignoring an applicable Internet
Standard that is much later than 2141.


> 3.  Even if you accept (as RFC 3969 states) that the name URN
> applies to things other than those defined in 2141, the
> situation we were left with is both confusing and cumbersome
> to discuss. The identifiers defined in RFC 2141 have unique
> properties, by design, which do not necessarily apply to other
> persistent URI-like identifiers.   And it's cumbersome to
> discuss RFC 2141 identifiers specifically (for the purpose of
> updating RFC 2141, or for any other purpose) while still being
> consistent with the language in RFC 3969.   You end up either
> saying "for the purpose of this document, URN refers to the
> identifiers defined in RFC 2141, language in RFC 3969
> notwithstanding" (inviting confusion from those who miss that
> restriction), or you end up saying something like "URNs as
> defined in RFC 2141" every time you need to refer to that kind
> of identifier.

Yes, that is a bit of an editorial challenge.  I don't see it as
a serious substantive problem because I don't think anyone is
claiming that draft-ietf-urnbis-rfc2141bis-urn should be
anything but 2141bis... with adjustments made to conform
2141-style URNs to the requirements of 3986.  I don't see that
3986 requires draft-ietf-urnbis-rfc2141bis-urn to adopt any
fundamental definition for URNs other than that of 2141 and
don't think the draft contains such a definitional change.  So
I'm not sure where we are disagreeing.

> Regardless of what RFC 3969 says, the identifiers most often
> associated with the term URN are undoubtedly those that begin
> with 'urn:' and have a syntax consistent with RFC 2141.
> Trying to say that there are URNs that don't begin with 'urn:'
> is like saying that there are other HTTP URLs that don't begin
> with 'http:'. Yes, it's true in a sense, but it's just silly
> and confusing and there's no good reason to define things that
> way.

But, as far as I can tell, draft-ietf-urnbis-rfc2141bis-urn-05
doesn't do that and in fact carefully avoids it.  So the above
is either an attack on 3986 (and hence out of scope for the WG)
or just isn't relevant.

> Also, regardless of what RFC 3969 says, URNs as defined in RFC
> 2141 weren't designed to be used with fragment identifiers or
> query strings - or at least, we didn't manage to define how
> that would work, and the reason that we didn't define how that
> would work is because there wasn't an obvious interpretation
> that didn't kill the properties we wanted for URNs.   So we
> declined to do that in RFC 2141, and in trying to generalize
> URI syntax, RFC 3969 didn't address those issues either.

And here we get to what I think are two of the core issues.
Let's separate them into two questions:

	(i) If there were no syntax restrictions imposed by
	2141, 3986, or anything else, would fragment identifiers
	and/or queries be appropriate for URNs?
	
	(ii) Is 3986 required to authorize fragments or queries
	in the URN syntax and, if so, is that a serious and
	problematic incompatible change?

Let me address the second here and the first below.

You read the statements in 2141 (or perhaps some oral tradition
to which the rest of us don't have access) as prohibiting
fragment identifiers and queries.  When I read what I hope are
the same statements, I conclude that there are lots of ways to
say "prohibit".  One of them involves the term "excluded", which
is exactly what 2141 says about a number of characters in its
Section 2.4.    But what it says about "?" and "#" is to
identify them with purposes defined in RFC 1630 -- no appeal to
presumed 3986 revisionism is required-- and "has not yet debated
the applicability and precise semantics of those purposes as
applied to URNs".  It then says "these characters are RESERVED
for future developments".  To me, those are statements that
imply that those future developments are anticipated, even
though no timeframe is given for them (and they might not
happen).

When the spec explicitly says those sorts of things about future
definition of semantics and future use, it seems to me that
comments like "weren't designed to be used..." (with the
implication of "designed to _not_ be used") are a real stretch.
"didn't manage to define how that would work, and the reason
that we didn't define how that would work is because there
wasn't an obvious interpretation that didn't kill the properties
we wanted for URNs" might be true, but there is no evidence at
all for it in 2141.  All I can get from 2141 is that there
wasn't consensus on particular semantics but that there was no
particular reason to expect that semantics and consensus would
not emerge in the future.  Again, if the intent had been to say
"we concluded that this was impossible without violating
fundamental URN design decisions" or just "really bad idea" how
do you explain 2141 not just saying that rather than talking
about future use?

Your memory or that of others as to what you thought the intent
was at the time notwithstanding, 2141 is now a 16-year-old spec.
I think what it actually says has to be taken at face value,
especially if it seems to contradict what you believe was the
intent.

>...
> 4.  URNs (and when I say URN, I always mean the 2141
> definition), above all else, are intended to be persistent.
> This is the fundamental property of URNs - not only that they
> are persistent, but that the presence of the 'urn:' tag is an
> indicator that the identifier may be interpreted as
> persistent, and also that the persistence of that identifier
> is an important property that should be maintained.
> Extending the concept of URNs in such a way that the resulting
> identifier is no longer persistent would break the essential
> and fundamental property of URNs.

I don't think anyone is disagreeing about that.  At least I'm
not and I don't see anything in the current version of 2141bis
that does either.

>...
> 6. Fragment identifiers as used in existing content-types were
> not designed to be persistent across changes to the document.
> For identifiers to have persistence there needs to be some
> discipline in assigning meaning to them initially and in not
> reusing them in ways inconsistent with their originally
> assigned meanings.   If there is any such convention for
> fragment identifiers, I'm not aware of it, but it certainly
> isn't widely used.   Given the existence and utility of other
> kinds of identifiers which are not persistent and should not
> be so, I also believe that persistent identifiers need to be
> readily distinguished from non-persistent identifiers.
> Again, I'm not aware of a convention for doing this, though I
> hypothesize that one could exist.

I think you are creating a strawman here and then demolishing
it.  There is no question, in my mind, at least, that using,
e.g., a character offset fragment identifier would be truly
stupid in a context that requires  persistence, URN or
otherwise.  But that doesn't imply that, for some URN types
(namespaces) stable fragment identifiers cannot be properly
identified and defined.  Second, even after rereading 2141 and
1737, I'm not sure how far one can go in the direction of an
identifier that is "persistent across changes in a document"
because even that statement takes one very far in the direction
of needing a universal theory about what changes are still "the
same document" and what changes make new documents. 

More important, as soon as you say "persistent identifier" and
"changes to object" in the same context, you transport us all to
to the edge, not of rathole, but of a bottomless pit.
"Persistence" of an identifier is clear when there is a single,
unique, object and the only thing that changes is its location.
That is where one of the major threads that led to URNs started
when web objects were considered -- URLs were just not right
when one considered content that might be relocated, unchanged,
from one server (and DNS name) to another.  But, as soon as one
talks about two objects that are alleged to be identical or
changes in one object, the "persistence" and object-binding
validity of the identifier become fairly deep questions that
have plagued archivists, classifiers, and philosophers for
centuries... questions that ultimately have no clear answers
except in the axioms and postulates of object-type-specific
axiomatic systems.  Another key reason why I shut down the
original URI WG in the hope of saving the work as that several
of the efforts that were underway appeared to have comprehensive
solutions to the "can it change and be the same thing" and "are
two objects actually identical" questions in the critical path
of a protocol or identifier type questions -- a WG that could
safely be predicted to go on for years and indeed centuries and
never converge was just not considered acceptable in the IETF of
the time.

Note that the above has nothing to do with fragments or queries.
The problem exists in its full glory with URNs and URN-object
bindings as soon as you say that something _is_ "designed to be
persistent across changes to the document".  Whether fragments
make things any worse depends on what the namespace looks like
and how things are designed.  

Let's take the fairly familiar example of a book.  The
publisher, library, and archival communities have established
conventions about when two copies of a book are "the same".  If
they are "the same" then we would except either copy to
represent a correct binding (and resolution response) to a given
URN (or instances of that URN that match the 2141 equivalence
rules).  It is important to understand that "the same"
represents conventions and that pushing the limits too hard
leads to confusion and/or high-minded arguments.  For example,
two different editions are almost always considered different
books for identification or classification purposes.  Two
different printings in which a few obvious typographical efforts
are corrected in the later one are usually considered the same
book.  If one physical instance of the book is autographed by
the author and another is not, they are the "same book" for many
purposes but are certainly not "the same".   

The only possible definitive, convention or axiom-free
definition of "the same" requires agreeing that all objects are
unique and that no URN can be satisfied by more than one
volume-object. (Of course, that is pretty close to an axiomatic
statement too, but of a different type than we we talk about
multiple satisfying objects.)  But "all physically distinct
volumes are unique" would, as a rule, make book-URNs useless for
most of the purposes to which one would like to put them.

Now, given that hypothetical book-URN (i.e., "urn:book:..."),
whether a transformation of the content from bound paper form to
electronic (and non-page-image) form is a change that still
allows both forms to satisfy the same URN is another one of
those philosophical questions that can be resolved only by
convention.  If the convention is that they are the same, than
any fragment identifier that utilizes page numbers is obviously
trash (and cannot be "persistent" across the two forms).  But
one that utilizes chapter numbers or even names doesn't make the
URN any less persistent than it would be if fragment identifiers
are not used or not allowed.  

Suppose, instead, that the convention is established that a
translation is still "the same book".  Now that convention would
make me and probably others very anxious but I can find nothing
specific in 2141 or even 1737 that prohibits it. The nervousness
arises from the URN and namespace definition and not from the
presence or absence of fragments.  But it would _constrain_ the
types of fragment identifiers that make any sense at all.  For
example, using the example above, chapter numbers would still be
sensible as fragment identifiers (at least modulo a few i18n
issues) but chapter names almost certainly would not.

(Aside for the record: whatever that hypothetical "book" URN
might be and how it might be defined,
draft-ietf-urnbis-rfc3187bis-isbn-urn is not it.  The latter
identifies, among other things, a particular set of conventions
about uniqueness and who gets to determine it together with the
consequent rules about object-bindings.  By doing so, it
"solves" a lot of the more general issues described above but
also covers over some differences that might be important for
other purposes.)

> 7. Thus, the combination of a URN and a fragment identifier
> has no assurance of persistence.   It follows that the
> combination of a URN and a fragment identifier cannot be a URN.

That does not follow at all.  If it does, it leads to the
interesting conclusion that a URN cannot be persistent enough to
be a URN unless it names only a single and unique object with no
possibility of changes to the object itself.    It does follow
that fragment identifiers to be used with (or as part of) URNs
have to be designed with far more care about the nature of the
namespace and what that namespace is used to identify than
fragment identifiers for, e.g., URL-identified web pages have
often been in the past.

> 8. One can argue that the persistence of an identifier
> consisting of a URN and a query string could actually be
> persistent, if the resource named by the URN were defined in
>...

I think that query strings associated with URNs are far more
problematic than fragment identifiers because fragment
identifiers (at least by historical convention) point to
something _within_ and object or some subdivision of it.
Queries, as your example (not quoted here) suggests (at least as
I interpret it), can, in principle, be used to take the rest of
the URI as input and return something completely different.
Defining such a situation in a way that would assure persistence
is hard at best.  

Extending your example a bit and combining it with my
hypothetical "book" URN, one could imagine
   urn:book:....?reverse-citation-index

which would return all of the known books or articles that cite
that book.  Since a new citation could be added at any time, the
practical persistency problems of that query are horrible even
if the query could be well-defined (it can't, at least without a
definition of the sources to be searched, but that is just a
property of my choice of a simple, but sloppy, example).

> 9. At any rate, existing resources that accept query strings
> do not in general assure persistence of the results of such
> queries.   Thus, in general, a combination of a URN used to
> name an existing resource, and a query string, provides no
> assurance of persistence, and the combination should not be
> considered a URN.

This is where I wish the IETF could make a bam ban assertions
that are not backed up by citations or evidence that is visible
to the community.  

Let me state that in a way that more closely aligns with the
reality I've seen and that has been pointed out to me.

	"Some existing resources that accept query strings (or
	fragment identifiers) do so in ways that are
	ill-considered and that do not assure persistence of the
	results.  Other existing resources and uses do and are
	fine.  The question is whether it is appropriate to try
	to ban the latter because there are a certain number
	(even a large number) of bad examples or if we should
	try to define things so that the good cases are allowed
	and we are more clear about why the bad cases are bad.".

> The above, I submit, is reality.   (There are probably some
> other relevant and salient points which are also defensible as
> reality.)

Note that, while our realities may differ, part of mine is that
I am extremely positive that, if the IETF says "don't do that,
we think it is evil", we will mostly be ignored and fragments
and queries in things that people will persist (sic) and that
people will persist (sic) in calling URNs and using
"urn:namespace:..." syntax to describe.  If we say "the syntax
is valid but one must be really, really, careful about how the
things are used to ensure that persistence is maintained" and,
ideally, explain why that is important, then we will affect the
behavior of at least some of those who are trying to do The
Right Thing.  If se say "don't do it because we said so" and ban
the syntax, it is nearly certain that we will be ignored by
existing uses of fragments and/or queries and by lots of
potential ones.  And saying "even though that thing that
conforms to the URI syntax and starts in 'urn:', it isn't a URN
and you are forbidden to call it one" is even more useless.  At
least in my pragmatic, observational, reality.

In the interest of even a weak approximation to brevity, I'll
skip comments on your brainstorming for now.

I think this does suggest that 2141bis needs an additional
section that says something like the following.  I've written
this first cut on the assumption that we will allow fragment
identifiers in queries in the syntax, but I believe, for the
reasons explained above, that most of the material is needed
regardless.  Even if we decide to not allow fragment identifiers
and/or namespaces, the relevant text below could probably
usefully be adapted into a "why not" explanation (rather than
having to rely on an IETF assertion of authority).  Some of the
other material above may be useful; I'd be happy to see it in
the document if Peter and the WG believe that would help.   I
believe that, if this type of material is added to 2141bis, it
should be explicitly identified as updating RFC 1737 by
clarifying issues associated with the "requirements" of that
document.

	"The notion of 'persistency' of a URN and  its
	relationship to whatever resource it identifies is key
	to the nature of URNs as defined in this document and in
	the original functional specification [RFC1737].  That
	notion and the associated relationships are, however,
	somewhat elusive and are likely to depend, in practice,
	on conventions and the properties of particular
	namespaces.  For example, if one can speak of replicated
	versions of a resource, transformation of a resource
	into a different form without affecting its content or
	nature, or even changes to a resource that don't alter
	what a URN identifies, one must either establish very
	specific conventions or move into the fundamental
	philosophical problem of when two objects can properly
	be considered "the same".  In more practical terms, if
	replicated objects are considered different for some
	purposes but the same for others, the "Global
	uniqueness" criterion of RFC 1737 Section 2 may easily
	be violated. so the conventions about identity and
	uniqueness are important parts of the namespace
	definition even though, in practice, they may be better
	articulated for some namespaces than for others.
	
	"It is important to note that universal conventions are
	almost certainly impossible: there is no reason to
	assume that the conventions that apply for one namespace
	will apply to another.
	
	"These issues are a large part of what make fragment
	identifiers and queries problematic for many URN
	namespaces and create a requirement for very careful and
	namespace-sensitive definitions in the namespaces where
	they are allowed.  A badly-designed fragment identifier
	may be inconsistent with the stability and persistence
	of a putative URN if replication or any changes all all
	to the names object are allowed.  A badly-designed query
	string may require reference to information or
	resolution of objects outside the namespace, thereby
	undermining multiple key URN properties as identified in
	this document, RFC 1737, and elsewhere.  In addition if
	ether were to follow trends common in contemporary usage
	of queries and sometimes fragments in URLs, the
	requirements of Section 3 of RFC 1737, especially those
	for Human transcribability and Simple comparison, could
	easily be violated.
	
	"To the extent feasible, definitions of particular URN
	namespaces should be clear about the relationships
	between the URN, the namespace, and the underlying
	objects; about the implications of replication and
	various changes to the named resources; and, if fragment
	identifiers or queries are allowed, how they should be
	constructed and constrained to preserve identifier
	persistency and to meet the other requirements of this
	specification and RFC 1737."

That is obviously just a first cut, but maybe it will help us
understand at least some of where we disagree, what the problems
actually are, what problems can and cannot be solved (especially
in a general, rather than per-namespace, way), and how to move
forward.

best,
    john