Re: [urn] Suggested changes to rfc2141bis-14

John C Klensin <john-ietf@jck.com> Tue, 24 November 2015 19:24 UTC

Date: Tue, 24 Nov 2015 14:24:33 -0500
From: John C Klensin <john-ietf@jck.com>
To: "Hakala, Juha E" <juha.hakala@helsinki.fi>, urn@ietf.org
Message-ID: <1C9DCE198CCDA9BBDB85561E@JcK-HP8200.jck.com>
In-Reply-To: <AMSPR07MB4540C7EF31B548ABA068AD1FA060@AMSPR07MB454.eurprd07.prod.outlook.com>
References: <AMSPR07MB45438F00E2C6E96184F112BFA1A0@AMSPR07MB454.eurprd07.prod.outlook.com> <137731ACF712EA751660E871@JcK-HP8200.jck.com> <AMSPR07MB4540C7EF31B548ABA068AD1FA060@AMSPR07MB454.eurprd07.prod.outlook.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Archived-At: <http://mailarchive.ietf.org/arch/msg/urn/PB7RPCOokn6a70dWdwkwHonqpWA>
Cc: Tobias Weigel <weigel@dkrz.de>
Subject: Re: [urn] Suggested changes to rfc2141bis-14
Precedence: list


--On Tuesday, November 24, 2015 07:54 +0000 "Hakala, Juha E"
<juha.hakala@helsinki.fi> wrote:

> Hello John,
> 
> A few comments and answers below. 

>...
>> > 1. rfc2141bis-14 states in chapter 3.3.1 that "q-component
>> > SHALL NOT be taken into account when resolving a URN to a
>> > URL".
>> > 
>> > There are two problems with this:
>> > 
>> > First, a URN can be resolved to 1-n URLs.
>> 
>> Or, more accurately, 0-n URLs.  I had already picked this up,
>> although I like your reasoning and suggestion better.
> 
> 0-n is fine with me. I was only thinking about URNs that are
> actionable. 

And I was thinking about both those you would describe as
non-actionable and those actionable ones that return things
other than URLs.   In any event, text has been adjusted.

>> > Second, depending on what service is being requested, it
>> > may be necessary to resolve a URN to different URLs. For
>> > instance, if a user wants the identified resource, the URL
>> > will be the one belonging to a digital asset management
>> > system holding the document. But if the just descriptive
>> > metadata about the resource is requested, the most
>> > appropriate URL may be the one of the national bibliography.
>> > 
>> > I suggest the following modification of 21421bis:
>> > 
>> > "As described under Section 4, the q-component SHALL NOT be
>> > taken into account when determining URN equivalence.
>> > 
>> > Similarly, the q-component MUST NOT be used to communicate
>> > service related requests and parameters to a resolution
>> > service.
>> 
>> I'm a little reluctant to make that a MUST NOT, which is a
>> very strong requirement.  It might actually be appropriate
>> when the URN can (and will) be resolved to exactly one URL
>> (and not multiple URLs, selected URLs, or anything else),
>> but, if that was the intended meaning, someone should speak
>> up so I can try to make it a lot more clear.   

> I'm a little reluctant to make decisions based on an
> assumption that some URN can and will be resolved to one and
> only one URL. Even if at first there is just one URL to
> resolve to, there may eventually be several if the document is
> preserved for long term. Documents are copied from
> institutional repositories to digital archives, with
> accompanying metadata. 

I think we may be agreeing but confusing each other with
different wordings.  Let me try this a different way; if you
agree, I'll try to tune the text (I am not proposing what
follows as text for 2141bis even if it sounds like it might be
used that way... unless people believe that including it would
provide significant and useful clarification):

The "value" of a URN, also known as its result or the result of
"resolving" it, may be:

 (1) Nothing at all, if the whole namespace or some NSS
	value within the namespace is intended as a pure
	identifier (also known as a URN that is not actionable),
	i.e., to be compared for equality only, not "resolved".
	 
 (2) A single URL, identifying or pointing to a single
	instance of a function or retrievable object.  An 
   r-component might be used as a selector for that 
   instance, e.g., to specify the difference between 
   URLs identifying a data object and metadata.
	
 (3) One of a family of possible URLs.  Each member of
	the family identifies or points to an instance of
	function or retrievable object.  Each of those functions
	or objects are considered equivalent to all the others,
	using namespace-specific rules.  Note that equivalence
	among objects or functions has nothing to do with
	whether two URNs are equivalent.  The selection model
	for which URL from a family is chosen is far outside the
	scope of 2141bis but might include network distance,
	possible speed of retrieval, cost, or criteria specified
	as part of an r-component.
	
 (4) More than one URL, with each one being treated a
	member of the same family and hence equivalent to each
	of the others, as above.  The difference between this
	case and (3) is that, in (3), the URN resolution or
	mapping mechanism determines which URL will be most
	appropriate for the user/client while, for this case,
	the determination is at least partially made by the
	client after receiving multiple URLs.  As with (3),
	r-components and other information might be used to
	narrow the range of URLs that are returned.  In some
	case, (3) is just a special case of this one in which
	exactly one URL is returned rather than one or more.
	
 (5) More than one URL, with the collection identifying
	the elements of a multi-part object, function, or a
	combination of objects and functions.   "Multi-part" is
	a function of the namespace and objects in it and might
	include a file and separated metadata.  Any of the URLs
	might either be unique to the URN (as in (2) or members
	of a family (as in (3)).  An r-component might be used
	to select the elements for which URLs were desired as
	well as specifying URL criteria for a given element as
	in (3) or (4).
	
 (6) Either objects or identifier-like locators that are
	not URLs.  Nothing I can find in 2141 or WG conclusions
	about 2141bis requires that URLs are the only possible
	actionable results of URNs.  These objects or locators
	are otherwise similar to URL values in the categories
	above.   r-components might be used to specify whether a
	URL or an object is returned as well as providing
	selection criteria for particular objects or functions
	within an equivalence set or determine which sets of
	objects are to be returned.

Now, if we can agree that all of those uses of URNs are
appropriate, then we have a basis for both fine-tuning 2141bis
and evaluating standardized resolver specifications.  If we can
reach consensus that some of them are not appropriate uses of
URNs, that moves us forward as well, but I continue to think
that different of us are assuming some, but not other, cases and
that is causing confusion about, e.g., "1 or more URLs", which
could refer to any of cases (2) - (5).

I _think_, but am not at all sure, that the case list above also
provides a complete list of cases in which r-components are
useable and appropriate.  Anything else is presumably applied
after the URN is mapped into something else according to one of
those categories and is hence a q-component or f-component.

Note also that Case (5) presents a challenge for f-components
because it isn't clear from 3986 whether the f-component belongs
to the first URL that is returned, each of the URLs in the set,
or something else.  For Cases (2) and (3), the answer is
presumably obvious and, for Case (4), it would presumably be
applied to all of them.

> I used MUST NOT to underline the difference between r- and
> q-components. If clients are allowed to send service related
> requests intended to the resolvers in q-components, then
> resolvers must check q-components and determine if they should
> do something with them (beyond just deciding the correct
> target system to which the q-component can be sent
> un-altered). This would complicate resolver implementation a
> lot.    

Indeed.  But, because of the list of the cases above, I don't
think the text you had adequately makes the distinction for all
of them, or probably anything other than cases (2) and (3).

>...
>> > A resolution service MAY parse the q-component in order to
>> > determine an appropriate target system to supply the
>> > requested service. "
>> 
>> Here I get confused.  I have assumed that determining an
>> appropriate [target] system to supply a requested service
>> falls
>> well within the scope of r-components.   Can you explain more
>> about what you had in mind?
> 
> Let us assume that a user wants descriptive metadata about the
> identified resource. This can be done either with r-component
> or with q-component, and both of them may be constructed on
> the fly by the client the customer is using. 
> 
> If r-component is used, it is obligatory to specify the
> service and service related parameters (if any);

I think that is true unless the service is intrinsic to the
namespace definition.   Do you agree or do you think it always
needs to be explicit in the r-component?

> in this case
> the service is (in RFC 2483 syntax) I2C (URI to URC) and the
> parameter is the appropriate metadata format. The client may
> also specify a target system (server) to which the request
> should be sent. In plain text a typical URN resolution request
> could be something like "retrieve from the Library of Congress
> online catalogue a MARC record describing the resource with
> ISBN XXX".

Just to be sure we are talking about the same thing, in terms of
my case breakdown above, you are proposing to use the
r-component to identify both the criteria for selection of a URL
from a family (Case (3) for "Library of Congress online
catalogue") and the particular URL (Case (2) for "the URL for
the particular data or metadata to be returned", in this case
the MARC record.   Wfm, but I sense a slippery slope in which
the Case (2) r-component element might not be compatible with
different values for the retrieval location or system associated
with Case (3).

> It is up to the resolver to modify the r-component
> containing this request into something (such as SRU search
> URI) that a server in the LoC application can support.

Yes, and this is an important point about the URN -> URL mapping
not being a purely mechanical operation that can be carried out
in the same way by every resolver and/or client.

> Eventually some systems may be able to deal with "native"
> r-components, but at first this will not be the case.   
> 
> Note that "hard-coding" the target system in the r-component
> may be risky, because the preferred target system may no
> longer exist or it may not be accessible because of a paywall
> or technical issues. The resolver may be aware of 1-n other
> systems that can supply the requested service, so it must be
> able to override the client's preference when necessary.

I'd hope that (even the idea of overrides) could be
per-namespace because it seems to me that the alternative takes
us back to a universal resolver-finder.  Otherwise, if nothing
else, we'd be creating a huge security hole, equivalent to the
"right" of DNS servers to rewrite queries or responses to their
convenience -- a right DNSSEC is intended to eliminate.

> If a client relies on q-component it may construct an SRU
> query, and pass it to the resolver which then sends it
> unaltered to an appropriate target system. The target server
> may be hardcoded in the URN, as in this example:
> 
> urn:isbn:9789522222725?http://lx2.loc.gov:210/lcdb?version=1.1
> &operation=searchRetrieve&query="urn:isbn:9789522222725"
> &startRecord=1&maximumRecords=1&recordSchema=dc
> 
> In this case, URN resolution does not provide much added
> value. Q-component of the URN is just passed on to the SRU
> server in the Library of Congress online catalogue in order to
> retrieve a Dublin Core record of the identified book. Even in
> this case the resolver needs to pass the q-component on. But
> at least in principle the client could omit the server
> information and leave it to the resolver to use its own
> preference for SRU protocol. Or, if the server address is
> known to be invalid, the resolver can replace it with
> something else.

Yes.  But it seems to me that your two examples could be used to
prove that the distinction between the appropriate uses for
q-components and r-components is not nearly as clear as we have
been assuming or pretending.

> Current URN resolvers are not smart like this; none of them
> can deal with q- or r-components yet. This "intelligence"
> needs to be built into them, and we must be careful not to
> impose any artificial limitations to this, like hardcoded
> server addresses which cannot be overridden.  

I think a distinction is important between what the standard
requires and what some namespace might choose to allow.  For the
standard to require "hardcoded server addresses which cannot be
overridden" would clearly be a disaster.  However, I can imagine
namespaces that are sufficiently tied to one set of servers
that, while the URNs for that namespace are persistent,
inability to access that server makes that namespace useless for
anything but identifier purposes.  I wouldn't want to encourage
such cases, but I think it would be dangerous to forbid them.

>...
>> > 3. In chapter 3.3.3, the draft currently says:
>> > 
>> > "When a URN containing an f-component resolves to a URL, the
>> > f-component from the URN is copied verbatim into the
>> > fragment of that URL."
>> > 
>> > and a bit later:
>> > 
>> > "Similarly, the f-component MUST NOT be passed to
>> > resolution servers when querying them for resource
>> > locations or metadata."
>> > 
>> > The problem with the former sentence is that the URN does
>> > not always resolve to the URL of the identified resource.
>> > If the result is an information page describing the
>> > resource or a metadata record, adding a fragment does not
>> > make sense. And IMO the latter sentence is meaningless
>> > since resolution servers should a priori ignore
>> > f-components no matter what service is being requested. Of
>> > course, web browsers usually do not send fragments at all
>> > to HTTP servers, so at least in the case of HTTP it would
>> > be a bad idea to use f-component in the resolution process.
>> > 
>> > We are better off saying for instance:
>> > 
>> > "If a URN containing an f-component resolves to a URL of
>> > the named resource, the f-component from the URN can be
>> > applied (usually by the client) verbatim as the fragment of
>> > that URL.
>> 
>> Ok.  But see the comment below the next paragraph and the
>> separate "metacomment" note.

And the comment about fragments and Case (5) above.

>> > Clients SHOULD NOT pass f-components to resolution servers.
>> > If a URN containing an f-component is received by a
>> > resolution server, the server SHOULD ignore the f-component
>> > when processing the URN.
>> 
>> I know that just about everyone who thinks about this has a
>> particular processing model and even a style of API in mind,
>> but I think we need to be very careful to not build those
>> models into the text unless (i) that is a requirement we want
>> to make and (ii) we have clear consensus about it.  We have a
>> convention (not always followed) that one should not say
>> "SHOULD" (or "SHOULD NOT" without at least a general
>> description of the exception case(s).    The above could be
>> improved considerably by saying something like
>> 
>> 	"Clients SHOULD NOT pass f-components to resolution
>> 	servers unless those servers also perform object
>> 	retrieval and interpretation functions."
>> 
>> and then either hope that "when processing the URN" is
>> interpreted very narrowly or put in some further qualifying
>> language.  In that context, what do you (and others) think?
> 
> There may be cases in which fragments have a more important or
> different role from what RFC 3986 assigned for them. For
> complex objects such as research data sets fragments may be
> essential for retrieving what is needed. But it is target
> systems which are aware of this. I don't know if there is no
> need to extend this kind of knowledge about fragment usage to
> URN resolvers as well. But your version of the text is OK,
> since it allows yet another possibility to make resolvers
> smarter. 

Good.   FWIW, we have been beaten up very thoroughly about roles
for fragments that are different from the RFC 3986 discussions.
But, if we can move on without opening that issue up again, I
think we should do so.

>...
>> but one could equally well have (although I hope we don't)
>> 
>>    urn:isbn-metadata:xxx-yxx-...
> 
> In practice, this namespace will not get registered, since
> having ISBN embedded in the NID would create confusion.
> Metadata record identifiers have nothing to do with ISBNs;
> they tend to be database internal and if not, NBN may be used.
> So URNs for metadata records may be something like
> urn:nbn:xxx.  

You hope it won't be registered.   I hope it won't be registered
(and agree with your reasoning).  But the new registration
procedure does nothing to prevent that if someone is determined
to do so, no matter how misguided we think they are.

>> Now, it seems to me that, at least if the metadata are
>> returned in a form that has a media type, it would be equally
>> rational to put
>>     # author
>> 
>> at the end of either of those URN-strings.  But I think your
>> language prohibits that for one case and not the other.  Or
>> maybe it prohibits the fragment for both.
> 
> Resources and resource metadata tend to have different
> fragments, or same fragment would give different results. It
> is OK to do this:  
> 
> urn:isbn:xxx#chapter2 
> 
> if the e-book media type supports this. And it is OK to do
> this (if the metadata format allows it): 
> 
> urn:nbn:xxx#author
> 
> to find out who the author of the book is. But usually you
> cannot use metadata related fragments for the resource itself,
> or resource related fragments for metadata, since that would
> either not make sense at all, or would yield unpredictable
> results. There are no chapter 2's in metadata records. 

Understood.  I think that there are two other examples (both odd
cases) related to your examples above.    First, if "xxx" is a
compendium of some sort, then there might easily be separate
metadata for each chapter.  In that case the metadata for such a
compendium might well be organized by chapter and either of the
examples above (and, shudder, 

   urn:nbn:xxx#Chapter2#author )

might all make perfectly good sense.  I think the implication of
this is that some fragments are going to be nonsense when
applied to particular objects, even when those fragments are
attached to / part of ordinary HTTP URLs.   URNs might make that
situation a little bit worse, but, IMO, no much.


> Meaningful use of fragments is difficult if the user has no
> idea of what fragments are available in the identified
> resource. So it is possible that fragments will usually be
> applied when people cite publications. This is safer than
> using URL + fragment as long as the identifier is applied to
> single manifestation of the resource only. 

Absolutely.   But, again, not something I think we can restrict
in 2141bis.

best regards,
    john

[urn] Suggested changes to rfc2141bis-14 Hakala, Juha E
Re: [urn] Suggested changes to rfc2141bis-14 John C Klensin
Re: [urn] Suggested changes to rfc2141bis-14 Hakala, Juha E
Re: [urn] Suggested changes to rfc2141bis-14 John C Klensin
Re: [urn] Suggested changes to rfc2141bis-14 Hakala, Juha E