Re: [urn] Benjamin Kaduk's Discuss on draft-hakala-urn-nbn-rfc3188bis-01: (with DISCUSS and COMMENT)

Hello,

comments inline. 

-----Original Message-----
From: urn <urn-bounces@ietf.org> On Behalf Of John C Klensin
Sent: lauantai 9. kesäkuuta 2018 23.21
To: Benjamin Kaduk <kaduk@mit.edu>; Peter Saint-Andre <stpeter@mozilla.com>
Cc: urn@ietf.org; The IESG <iesg@ietf.org>; draft-hakala-urn-nbn-rfc3188bis@ietf.org
Subject: Re: [urn] Benjamin Kaduk's Discuss on draft-hakala-urn-nbn-rfc3188bis-01: (with DISCUSS and COMMENT)

--On Friday, June 8, 2018 15:32 -0500 Benjamin Kaduk <kaduk@mit.edu> wrote:

>> The semantics of r-components are yet to be defined. I would venture 
>> that the IETF is probably not the right place to do that work, given 
>> how little energy remained in the URN WG at the end (and we probably 
>> didn't have the right people in the room in the first place).

Juha: in order to avoid chaos, URN user community needs a centrally maintained registry of resolution services and parameters related to them. As far as I am concerned, r-component syntax and semantics should not be user- or namespace-specific, it has to apply to all namespaces.  So if somebody establishes r-component syntax for requesting a Dublin Core metadata record about the identified resource, the components used should be registered. And before creating the r-component, URN users should check from the registry if the required components (service and parameters) exist already.  

While writing the I-D, my assumption was that r-component usage will only start once there is an agreement on r-component syntax, and there is a central registry for services specified. Developing syntax should not be rocket science; what we need is a way to specify services and parameters related to them in a machine readable way. 

Assuming that each r-component is allowed to contain one and only one service and 0-n parameters related to it, syntax might look like this: 

s=<service>&<parameter1>=<value>&<parameter2>=<value>&...&<parametern>=<value>

for instance: 

s=URC&format=DC

to request metadata about the identified resource in Dublin Core format.  I am sure that people who are more technically oriented than myself will come up with something better than my example, but the components (services and their parameters) should be the same. 

Generic r-component syntax can be specified without any knowledge about the resolution services to be supported, but service specific details may only be provided by experts who know how current applications operate. I do hope that IETF experts can assist with syntax definition; once that (and the registry) is in place, URN user communities can start providing service level specifications. 

Users of persistent identifiers (DOI, Handle, ARK, URN) are all currently under pressure to enrich the functionality of resolvers. Unless a central (and shared) registry of resolution services is established, there is a clear danger that each identifier system will develop its own solutions, which will seriously limit interoperability between persistent identifier systems. 

> I won't argue with that.  Does it make sense to say something like 
> "There are not currently any broadly accepted semantics for 
> r-components at the time of this writing which may be grounds to be 
> cautious with their use" in this document?

Juha: such text can be added as a clarification. Since my assumption has been that the r-component will not be used at all before we have generally approved syntax and semantics for it, I might use even stronger formulation than just "cautious". But as noted above, I did not say anything about this because I thought that non-usage of r-component applies automatically to all URN namespaces as long as r-component syntax is still work in progress. It seems a bit redundant to repeat the same thing in all namespace registrations. 

Having said that, I do hope that the syntax will be formally specified soon, so that the URN user communities can start adding service level specifications into the central registry (which should also be there from the beginning).  

>> >    If an NBN identifies a work, descriptive metadata about
>> >    the work SHOULD be supplied.  The metadata record MAY
>> >    contain links to Internet-accessible digital
>> >    manifestations of the work.
>> > 
>> > This left me confused.  Is it only intended to apply in the case 
>> > described in the previous paragraph, where the resource identified 
>> > by the NBN is not available in the Internet?  Or does it always 
>> > apply, forcing the metadata to take precedence over delivering the 
>> > actual work?  (Or maybe I'm just confused, and there's an easy way 
>> > to deliver both metadata and the actual work alongside each other 
>> > with no ambiguity.)
>> 
>> Juha can clarify this.

Juha: I can understand why this confuses people. Sorry, this bit is library specific and hard to understand unless the reader knows our practices. 

Work itself is immaterial. There may be 0-n manifestations of it, some of them hand-held, some digital. 

Work is an umbrella concept with which it is possible to bring together all existing manifestations. In practice this can be done by e.g adding to the work metadata record links to the metadata records describing these manifestations. A practical example of similar practice is a splash page describing a research data set ("work level metadata"). Such pages often contains links to all versions of the relevant data set, with manifestation level metadata such as appropriate warnings if some versions of the data set are very large. 

A user who requires metadata about a work may not even know if there are digital manifestations of the work. But with the metadata record the user will be able to find this out, and select the version which suits his/her needs best.   

>> > Section 4.1
>> > 
>> >    National Bibliography Number (NBN) is a generic term
>> >    referring to a group of identifier systems administered
>> >    by the national libraries and institutions authorized by
>> >    them.
>> > 
>> > "the national libraries" implies a specific set -- which ones?  It 
>> > may be better to hedge with "some national libraries".
>> 
>> Or remove "the" ... "by national libraries".

> That's probably better :)

That would be my preference, but Juha should decide on this.

Juha: by national libraries is better. 

>> > Section 4.2.2
>> > 
>> > Do we need to say anything about a URN-to-URI step before talking 
>> > about URI-to-resource services?

Given what 3986 has to say, a URN-to_URI step would be an oxymoron.  If you meant a URN-to-URL step, that is probably a matter for 8141 and it may be worth pointing out that members of the web community (a euphuism for a particular, mostly known, set of individuals who claim to speak for that community in case you haven't figured that out) have been violently opposed to such text, claiming that, if it is needed, then there is really no need for URNs.  On the other hand, while the URNBIS WG could not reach consensus on any particular proposal and did reach consensus about not trying to proceed with definitions, that is much of what r-components are expected to be about.

>> > I'm also wondering about any relationship between "component 
>> > resource" NBNs and f-components of the containing work.  If there 
>> > is are NBNs assigned to both an image within a work and that 
>> > containing work, and an NBN with f-resource is used to refer to the 
>> > image within the containing work, is there any relationship between 
>> > the f-resource and the image-specific NBN?

On a per-sub-namespace basis, possibly.   In the general case,
maybe.  This is not an NBN issue but an issue about how namespaces are managed, organized, and used, i.e., probably an
8141 issue.

Juha: with NBN, national libraries have free hands to specify their own naming policies. They may give just one identifier for e.g. an entire EPUB 3.0.1 e-book, or they may assign separate identifiers for all the component parts of the e-book. The best policy depends on many things, including the level of control & access required / possible. 

NBN specification does not specify limits on what can / should be done. If the library prefers to use f-component to identify for instance images in a PDF document, that is fine.     

>> > Section 4.3
>> > 
>> >    Expressing NBNs as URNs is usually straightforward, as
>> >    only ASCII characters are allowed in NBN strings.  If
>> >    necessary, NBNs MUST be translated into canonical form
>> >    as specified in RFC 8141.
>> > 
>> > When is it necessary?
>> 
>> It seems that in theory an NBN itself could contain non-ASCII 
>> characters, whereas an NBN URN and its nbn_string construct can 
>> contain only ASCII characters. At least that is my understanding.

That is correct.  But, more or less per 3986, _any_ URI can contain non-ASCII characters in the tail by %-encoding them.
There were some moves in the URNBIS WG to restrict that for URNs, but it met resistance from the usual suspects.  The bottom line here, and I don't know how loudly to say it, is that using non-ASCII characters in nbn_strings would probably nothing short to stupid, especially given that both IETF and W3C have suggested that they be avoided in identifiers non-specialist end
users are not expected to see.   However, due to a problem that
goes back well before the early decision that ISO 8859-1 was going to be an adequate encoding for HTMP content (but of which that decision is symptomatic), it would be unsurprising if one or more national libraries whose local language uses Latin script with a few lightly-decorated characters had not taken that advice or had decided to incorporate existing (perhaps for
decades) identifier strings with a few Latin characters outside
the ASCOO subset into their national NBNs.   One could imagine
rewording the text mentioned above for more clarity (a job I will happily leave to the experts who make up the RFC Editor
function) but the bottom line is that all we do is to say "don't do that, but if you decide to do it anyway, this is what you must do to prevent even worse problems".  

Juha: all NBNs I have seen so far have contained just (printable) ASCII characters. But outside Europe there may be national libraries which have been more liberal. If so, non-ASCII characters in their NBNs must be %-encoded when these NBNs become URN:NBNs. 

Nobody knows if there are NBNs with non-ASCII characters, and if so, how common they are. Therefore I decided to drop the recommendation that such characters should be avoided in NBN strings. 

In order to clarify text, beginning of 4.3 could be edited into: 

Expressing NBNs as URNs is straightforward if NBN strings contain only ASCII characters. Non-ASCII characters, if any, MUST be translated into canonical form as specified in RFC 8141. 

>> >    Being part of the prefix, sub-namespace identifier
>> >    strings are case- insensitive.  They MUST NOT contain
>> >    any hyphens.
>> > 
>> > This MUST seems to just duplicate a syntactic requirement from the 
>> > ABNF; is RFC 2119 language really necessary?

>> /me shrugs

Probably not, but, while Juha should confirm, I assume that part of the origin of this text is that several other International Standard identifiers, e.g., ISBNs, all hyphens and treat them as
optional.   It might be wise to reinforce the message that the
URN:NBN solution to the problems that causes is to clearly say "no" and say that clearly enough that even those whose eyes
glaze over at ABNF will get the message.   Whether it is better
done by something like the sentences above or by saying "Hyphens are prohibited by the ABFN, see Section XXXX" is, IMO, a matter of editorial style and preference.

Juha: I believe this may be an error inherited from RFC 3188. The forbidden character should be colon. 

URN:NBNs with sub-namespaces look like this: 

urn:nbn:se:uu:diva-284370

This is a Swedish URN:NBN assigned by the Uppsala university.  Organizations which have a sub-namespace may divide their sub-namespace further if necessary, using colons (e.g. Uppsala could create a namespace ID urn:nbn:se:uu:thesis:). Given the special role the colon has, sub-namespace identifiers must not contain them, since theoretically allowing colons could cause duplicate assignments. So if there is nbn:fi subnamespaces nbn:fi:aa, every NID in the form nbn:fi:aa:<string> must be a sub-namespace of nbn:fi:aa.      

Changing the specification might cause problems with backwards compatibility had some libraries assigned sub-namespaces with colons in them. I don't think that this is the case. So the next version of the I-D could say "MUST NOT contain any hyphens or colons". 

>> > Section 8
>> > 
>> >    John Klensin provided significant editorial and advisory
>> >    support for late versions of the draft.
>> > 
>> > Presumably that's "later versions"?
>> 
>> Yes.

I really don't care.  If one thinks this is an editorial problem, leave it to the RFC Editor.  If one thinks it is substantive, remember that, while this is a -00 draft, the I-D itself has been through many iterations under other names, so it depends on how you count because I had nothing to do with early version of the I-D and at most only a reviewer/participant role in RFC 3188.  If this were a different sort of document and I cared, I could make a strong case that I've been involved enough and have written enough text to be listed as a Contributor, but I think the nature of this document is that it is better if Juha is sole author without contributors other than Alfred. 

FWIW, I can't see why attribution at this level should be an IESG problem unless you have reason to believe that IPR rules are being violated.

Juha: I think "later versions" is fine. 

Finally, to avoid writing a separate note even though it will make this a paragraph longer, I think several of the comments you )Benjamin) and Adam have made make a strong case for a clarifying update to RFC 8141.  In principle, I agree with that.
It is little surprise to me that new URN namespace proposals are exposing issues that, if we had more ability to predict them, would have been reflected in 8141 itself.  The difficulty with such an update is that, at the time 8141 and 8254 were completed, the URNBIS WG had run out of energy and was developing a level of acrimony that made further progress unlikely.  If we were to try to open 8141 to do a clarification, I can just about guarantee that some of those who were the sources of frustration that led to that acrimony would insist that no document move forward until their pet issues and easy solutions were addressed.  That, in turn, would result in a situation like the i18n one, only with less downside if the issues are not addressed and more issues that soul require solving fundamental philosophical disagreements in IETF community. I can't recommend going there, but it doesn't seem to me that trying to clarify 9141 by text in a single namespace definition is the solution either.

Juha: library community has been using URN:NBNs succesfully since RFC 3188 was published. Tens of millions of identifiers have been assigned. From libraries' point of view, the important thing is that the revised RFC validates some new URN:NBN assignment practices which we did not foresee when RFC 3188 was written. I do hope that the revision will not get stuck on technicalities or philosophical disagreements which have only minor impact on practical work. 

In the long term I do hope that allowing the use of r-, q- and f-components will help libraries and other URN users such as film industry to build smarter URN resolvers. There is definitely a need for that.  

All the best, 

Juha

PS. I volunteer to produce the next version of the I-D, but should I use the txt or XML version? And if the latter, where do I get it (last time I edited the txt version). 

_______________________________________________
urn mailing list
urn@ietf.org
https://www.ietf.org/mailman/listinfo/urn