Re: Draft: Universal Document Identifiers

Peter Deutsch <peterd@expresso.cc.mcgill.ca> Thu, 05 March 1992 17:11 UTC

Received: from nri.nri.reston.va.us by ietf.NRI.Reston.VA.US id aa01042; 5 Mar 92 12:11 EST
Received: from nri.reston.va.us by NRI.Reston.VA.US id aa15719; 5 Mar 92 12:11 EST
Received: from kona.CC.McGill.CA by NRI.Reston.VA.US id aa15711; 5 Mar 92 12:11 EST
Received: by kona.cc.mcgill.ca (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA15624 on Wed, 4 Mar 92 17:07:50 -0500
Received: from expresso.CC.McGill.CA by kona.cc.mcgill.ca with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA15607 (mail destined for /usr/lib/sendmail -odq -oi -fiafa-request iafa-out) on Wed, 4 Mar 92 17:06:19 -0500
Received: by expresso.cc.mcgill.ca (NeXT-1.0 (From Sendmail 5.52)/NeXT-1.0) id AA12411; Wed, 4 Mar 92 17:06:04 GMT-0500
Message-Id: <9203042206.AA12411@expresso.cc.mcgill.ca>
In-Reply-To: Tim Berners-Lee's message as of Feb 27, 17:22
From: Peter Deutsch <peterd@expresso.cc.mcgill.ca>
Date: Wed, 04 Mar 1992 22:06:03 -0000
In-Reply-To: Tim Berners-Lee's message as of Feb 27, 17:22
X-Mailer: Mail User's Shell (6.5.6 6/30/89)
To: Tim Berners-Lee <iafa-request@kona.cc.mcgill.ca>, chi-arch@uccvms.bitnet, www-talk@nxoc01.cern.ch, wais-talk@think.com, iafa@cc.mcgill.ca
Subject: Re: Draft: Universal Document Identifiers
Cc: Rare WG3 <rare-wg3@surfnet.nl>, nisi@merit.edu

Tim,

Here's a response to your "Universal Document Identifiers
on the Network" piece. I was late, I was tired, and I
originally sent it to the irtf-rd list. Sigh. Apologies to
anyone who is thus seeing it again.

For anyone who will be at the IETF meeting in San Diego in
two week, I've tentatively put the issue of "Universal
Document Identifiers" vs what I call "Universal Document
Serial Numbers" on the agenda for the Living Documents BOF
on Tuesday night.


				- peterd


----------------------------------------------------------------------

g'day all,

This is a note to bring together some thoughts I've had on the
Unique Document Identifier question (UDI question for this
posting). In particular, I have just finished going over
Tim Berners-Lee's recent discussion piece and have some
thoughts and another view of things.

My basic proposal is only half-baked at this point, but
I'd like to circulate it before the IRTF meeting and
Living Docs BOF at the upcoming IETF to see if I ring a
bell with anyone.



The Problem with Unique Document Identifiers
----------------------------------------------

I've had some trouble with the various postings and papers
I've seen discussing the possibility of creating a Unique
Document Identifiers scheme (I remember one by Brewster Kahle,
and the more recent posting by Tim Berners-Lee in particular,
although I'm not citing anyone as causing my troubles - it may
just be me!  :-)

Without trying to repost everything I've seen, and probably
therefore in the process dangerously oversimplifying what
I've read, I'd summarize schemes such as the W3 naming
scheme covered by Tim in his recent paper ("Universal
Document Identifiers on the Network") as revolving around
encoding:

	a) a naming scheme identifier, followed by

	b) an address whose format is dependent upon the
	   naming scheme.


the other UDI schemes I've seen all seem to take a similar
approach. Basically they all appear to be defining a method
for encoding multiple types of "physical addresses" of some
sort, where physical address refers to a way of specifying
enough information in the particular naming scheme to allow the
reader to access the document. These are in effect
"multi-disciplinary document pointers", allowing us to
specify documents in multiple formats.

The problem I see is that two files on the Internet, stored at
different locations but containing identical information,
would still have two different Unique Document Identifiers.

Now, you can get around this problem by only storing pointers
to original documents wherever possible, but it seems to me
that sooner or later (whether it be lazy site administrators,
the wish to implement mirroring strategies to provide
robustness, or whatever) we end up with multiple copies of the
same document with differing UDIs.

Contrast this with the librarians' ISSN or ISBN numbers.  Each
and every copy of "The Complete Fawlty Towers" by John Cleese
and Connie Booth carries the same ISBN number (ISBN
0-413-18390-4 for those who don't have it yet :-) because each
and every one is merely an instantiation of the same
collection of information (Actually, this is not _quite_ true,
since I believe that the softcover edition would have a
different number, but assuming no softcover edition this is
essentially correct).

I think the distinction between what we have been doing, and
what the librarians are doing is an important one and drives
to the heart of my objections. I have a problem with speaking
about the need for Unique Document Identifiers, and then going
into detail about schemes that seem primarily adapted to
encoding access information without ensuring that multiple
documents are in fact unique.

Although the schemes proposed do solve some people's problems,
they don't seem to be solving the problem that I would want
addressed by UDIs. They do not even follow the model I
would choose to use to address the problem I want solved.

For example, when speaking of the structure of a physical
address, Tim speaks of needing a physical address to allow:

	- a user's program to contact the server
	- The server to search and index or retrieve the document
	- The user's program to locate an individual position 
	  or element within a document.

But what about identifying the actual document as the one I
really want? What about identifying multiple instances of the
same document under different access schemes or encoding
methods? What about differentiating between different
documents with similar "names" on different hosts?

Hopefully we can allow users to do all of this without
actually forcing them to examine the contents of each
candidate document in turn.

So this brings me to what _I_ need UDI's for. As the architect
of an information indexing and resource discovery system I'd
like to make it easier to build and operate such tools, and
I'd like to make it easier for my users to find things in my
system. I'm particularly interested in the early phases of
resource discovery, when users are performing class discovery,
and then instance identification. After all, in many cases
actual access of a document comes later, if at all.

In particular, when someone comes looking for a particular
document that exists on a multiplicity of sites, in a
variety of encoding formats, accessible in a variety of
ways, I'd like to be able to indicate to them when
multiple hits on a search are actually referring to the
same information _contents_, regardless of encoding
method. I'd also like to be able to let them know when
similar looking documents actually contain different
information.

I accept that people in real life want to make copies of
documents and I want to help them in this. As far as I'm
concerned, at least initially, it is the _content_ of a piece
of information that I want identified uniquely, not the
location of an instantiation of this contents.

Of course, I'd also like eventually to be able to tell my
users _where_ such documents live and how to access them,
but in the initial phases of discovery users (or their
tools) are having to weed out false hits, and I don't see
a lot in the general schemes that people are working on
that helps in this.

+   In an nutshell, I'd like to see the emphasis move to    +
+   uniquely identifying _content_ rather than encoding     +
+   of location and access method for UDIs.                 +


As a concrete example, the document "RFC-1296" (a recent
RFC describing the explosive growth of the Internet over
the past 10 years) exists on the Internet. Perhaps it
exists in both ASCII and Postscript formats at a number of
archive sites (In fact, I've only seen this RFC in ASCII,
but play the game with me for a moment).

I'd like _my_ UDI scheme to somehow indicate to me that the
various instantiations of this document ("rfc1296",
"RFC-1296.txt", "RFC/1296.ps" etc) are actually all the same
document, merely stored in different formats and locations.
Of course, I'd also like to know when two files having the
same name actually have different contents.

Now, certainly the work on UDIs that has already appeared
is addressing a real problem so perhaps I need to come up
with a new name for what I want, to avoid confusion.

+   Thus, I propose to call what I am discussing here       +
+   "Unique Document Serial Numbers" (UDSNs) to             +
+   distinguish them from the existing UDIs.                +


UDSNs would have the property that they could uniquely
identify information objects across encoding schemes, naming
schemes and file system representations. Thus, accessing RFC
1296 interactively through Gopher, programmatically through
Prospero or through email to archie should always return me
some indication that the content is the same when I find
multiple instantiations, provided the document has not been
altered other than in representation or encoding.

It seems to me at this point that creating a UDSN scheme would
not be difficult if we can set up the right model.

I would begin by modeling information as a series of objects
having a specific set of attributes. In particular, we require
the concept of "owner" or perhaps "author", a logical entity
corresponding to the creator of that particular information
object.

The UDSN is now an attribute that takes as a value an
ordered number unique to the collection of UDSNs for that
owner/author.

Each document can have only one owner, and a single owner can
have zero or more information objects. Each time the owner
creates a new information object (either from scratch or by
modifying an existing information object) it is assigned the
next UDSN number for that owner in sequential order, start at
UDSN 0 and increasing for each document.

Whenever someone actually copies (as opposed to referencing
with a hypertext-like pointer) such an information object,
the UDSN is copied with it. It must become an inseperable
attribute, permanently associated with that information
object.

Note, that this scheme would require some special
housekeeping if we tried to allow the single owner to
correspond to some meta-user on the Internet, allowed to
roam from site to site.

This could be avoided in a basic scheme (at least in the
current Internet environment) by specifying an owner as a
username/hostid pair, with hostid corresponding to a fully
qualified domain name and username corresponding to a unique
userid on that hostid (presumably, single user machines would
only issue one such userid).

Although this would work as an initial implementation, I think
it has obvious inherent limitations and would like to see it
avoided if possible. In particular, I don't want to have to
start a new series of UDSNs just because I moved to a machine
in Australia while on holiday. I want "meta-peterd" to keep
generating UDSNs in the same series wherever I'm currently
working.

Drawing on Cliff Neuman's Virtual System Model I'd postulate
the existence of a "Virtual System User", one whose home would
follow him around the net, as would his UDSN generator. In
this case we could also speak of Unique User Identifiers
(UUI?), as well. Details on this meta-user is left for
another time. 

A UDSN would thus consist of a single UUI (guaranteed to be
unique in the particular information space by whatever means
we come up with to implement this idea) and a sequential UDSN,
guaranteed to be unique for that particular UUI.  Together
they would be associated with any information object available
from your information delivery system. They could in fact be
incorporated into the UDI ideas already proposed.

Now, people searching archie for "RFC1296" could get a
variety of matches, with varying UDI, but from the UDSN
portion users would quickly identify duplicates (in fact,
their user agents could do that for them, sorting and even
collapsing duplicates if desired). People searching through
Prospero finding multiple files with similar names would be
able to identify copies of the same document without having to
access each of them in turn. Searching such a system for
matching UDSNs would quickly give us some measure of the
scale of duplication on the Internet...

Of course, the moment someone edited or modified a document
object, the UDSN would have to change, and we would have to
expect some assistance from our work environment to handle
such things as UDSN assignment, object/attribute editing, etc.
I've already described my idea of an "info daemon" to handle
Internet electronic publishing at the last IRTF get-together
and this seems like an ideal job for such a tool. When
registering a document, it would assign the UDSN, binding it
to the object before making it available to the network. The
act of assigning what in effect are per user ISSN/ISBM numbers
would be a important part of the act of "electronic
publishing".


The big objection to all of the above that I might see is that
given the UDIs already postulated by Tim and others such a
means for distinguishing duplicate and dissimilar documents is
unnecessary. If everyone stored only pointers to the actual
documents, UDSNs appear unnecessary.

Such an objection assumes that given a suitable UDI document
pointer mechanism we can eliminate duplication through
"socialization" of our users.  This may be true eventually,
but it certainly wont happen in the short term. I suspect
that in practice it wont ever be completely possible.

Certainly if we had UDSNs _now_ tools such as archie would be
that much more useful. I just can't see the need for them
going away any time soon. I'd like to pursue this idea, and
I am wondering if this appears to be of use to anyone else?



Okay, that's it for now. Kudos to those who made it this far.
Comments anyone? Have I missed the point? Am I barking up the
wrong tree, or have I misunderstood the current discussion on
UDIs?

Enquiring minds want to know.




				- peterd


----------------------------------------------------------------------

This posting elicited several responses. Here is a
reply to one by Mike Schwartz.


> From schwartz@latour.cs.colorado.edu Sat Feb 29 07:25:11 1992
.  .  .
> I think the problem is a good deal harder if you want to actually do this
> in a conceptual fashion (so that, for example, rfc1234.ps is considered the
> same as rfc1234.txt.  Even harder: so versions of these documents or
> variants are somehow represented as being related). .  .

It becomes hard if we try to make the UDI/UDSN also handle
version control and representation information, but actually
assigning unique document serial numbers should not in and of
itself be all that difficult. A unique string needs to be
generated (your CRC signature idea, perhaps coupled with
timestamp) and a signature for the author needs to be
generated and attached to the document when it comes into
existence, then preserved as the document is copied around and
we'll have a start.

The thing I want is that we see such a serial number as a
signature of the _contents_, not just a "universal encoding"
of the filename. Even if that's all we had, people doing the
first steps of information discovery could get a lot out of
this.

If we use just CRCs, as you suggest, then they can actually
just be generated as needed, and not stored, which might be
useful in some cases. Of course, tools such as interfaces to
ftp should also be expanded to let us query these CRCs without
bothering to copy the file over to perform the calculation
(Perhaps we need yet another argument to ls? (!) I can see it
now: " ls -# <filename> lists the file's CRC signature" :-)


I still would like to see us include an author ID, which
_would_ have to be attached and carried around (for why, see
below) but if all we want from document serial numbers is to
allow us to compare document signatures all we really need is
uniqueness of IDs and a CRC signature may be enough for this
in practice.

Version control and representation information are still
definitely useful, so how do we get them if not by
incorporating them into our serial numbers?  Now I've thought
about it a bit (ie since last night... :-) I don't think we
need or require such things be included at the lowest level
with the doc serial numbers themselves.

After all, on the UNIX file system we can distinguish between
copies of a file and multiple links to the same file by
comparing inode numbers, but that doesn't mean version control
and representation information must also be stored in the
inode. We can figure out such things using information
available to us "higher up". If we just want to know if two
references were to the same file, we can compare their inode
numbers without demanding that this number also tell us about
previous versions that existed or whether this is a Postscript
copy of some ASCII file in another partition.

I can imagine an SCCS-like file system that manages changes and
maintains information about previous versions of files, for
those who need or want it. One approach was outlined at the
last Usenix when someone presented details of a file system
that automatically tracked changes over time (it was something
like the "3-D file system" but I don't have my proceedings
here at home to confirm details). In any event, the system
lets the user access any file by name _and date_. Such an idea
could be used to track changes to files over time. 

And what about representational changes that don't change
information contents?

Can we recognize actions that change representation, but not
content? Maybe not algorithmically, since I can create a
Postscript version of an ASCII doc by loading it into
something like Framemaker and then saving in a different
format and name so unless all such programs are changed it
doesn't appear feasible to distinguish between real editing
and simple representational changes in such programs.

Still, we _can_ postulate the existence of some form of
"information broker", who has the job of managing UDIs and
UDSNs. A simple version change or file copy would be
registered with the broker. For version changes, the file gets
a new UDI and a new serial number, but the internal revision
history is updated. For file copies the broker would assign
the new file the same serial number, but different UDI. Any
action that brings a new document into existence would
also generate a new UDI and a new serial number.


If we accept the idea of doing representation and version
control at this higher level then we can manage
representational changes by "checking out" a copy of a
document when we want to perform a conversion. In such a
scheme, the parent document is handed over, and a converted
version is handed back, with a promise that the content is
unchanged except for its rendering. The broker now preserves
the serial number and grants a new UDI.

Conceptually, I like to think of documents as "belonging"
to their authors and so I imagine this broker as an
"author's publisher" that is responsible for managing
document objects on my behalf.  

I have already postulated such a broker to receive information
requests from the Internet for my host and am happy to have
it do double duty as my UDSN manager, but there is nothing
that requires an active process if we want to incorporate the
functionality into our favorite file system.

The advantage of having an active process (if we have
incorporated the author's ID into the UDSN) is that we can
always contact the author's broker for the version and
representation information, if we want it.

Thus we get back to what I wanted originally (contents
identification, version control, representation isomorphism).
All it takes is an author ID of some form and agreement on how
to translate that into a broker address (X.500, anyone?).

Separating out the version control and representation
management functions like this does appear to have several
advantages, not least of which it could be retrofitted on
top of most systems without changes to the O/S level
functions. It would also allow us to deploy the CRC-like
signatures _now_ since that's easy.

Incorporating the whole thing at a lower level is possible, I
just don't see it as necessary to get started.


This is certainly miles away from initial concerns for such
things as UDIs, but I offer the idea as one possible solution
that can coexist with my serial numbers without requiring any
retrofitting of existing systems.

Of course, if we can solve the problems I'd be happy to
have a serial number that allows me to trace change
history, parentage and representational changes as well as
distinguish objects by their contents. I'm just not that
ambitious. :-)


> .  .  .If you are willing to
> settle for unique identification of the bytes in a document (overcoming
> naming differences on different machines but not representational or
> version differences), you could use some sort of hashing or cryptographic
> checksum.  This is essentially what I did for file signatures in my FTP
> traffic measurement study, to decide when identical files were being
> transferred multiple times.  The win here is that you can get the IDs in an
> automated fashion. As so many of our prototype systems have shown, it's
> much easier to bootstrap a system into existence if you can run a program
> against existing data to generate the indices you need, rather than
> requiring global administrative agreement and manual human registration of
> the index information. 

This would allow us to go back and "sign" whole archives,
which would be a real win. There's nothing stopping us from
doing this today, and incorporating the signatures as a part of
the information offered. Something for us to look at with IAFA?

> .  .  . Obviously, what you want would be preferable in the
> long run, but then maybe you'll need to attend standards meetings for 2
> years before it's ready for mass consumption (and, you should probably talk
> with the library science folks)

I'd be happy with a simple agreement to use CRC signatures for
now. I can't see myself surviving 2 years of standards
meetings!  Actually, if I build and deploy my broker, then
when the standards people are finally done I can upgrade to
whatever was agreed upon. Meanwhile, I'm in business today.

The idea of talking to library scientists is a good one, too.
I'm going to the Net92/CNI meeting in Washington in a couple of
weeks and hope to broach the subject there. I don't know enough
about ISSNs/ISBNs but they do seem to have a lot of the traits
I want. Maybe what we really need to do is adopt them with the
librarians' blessing?




				- peterd

--