Re: Draft: Universal Document Identifiers
Peter Deutsch <peterd@expresso.cc.mcgill.ca> Thu, 05 March 1992 17:11 UTC
Received: from nri.nri.reston.va.us by ietf.NRI.Reston.VA.US id aa01042; 5 Mar 92 12:11 EST
Received: from nri.reston.va.us by NRI.Reston.VA.US id aa15719; 5 Mar 92 12:11 EST
Received: from kona.CC.McGill.CA by NRI.Reston.VA.US id aa15711; 5 Mar 92 12:11 EST
Received: by kona.cc.mcgill.ca (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA15624 on Wed, 4 Mar 92 17:07:50 -0500
Received: from expresso.CC.McGill.CA by kona.cc.mcgill.ca with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA15607 (mail destined for /usr/lib/sendmail -odq -oi -fiafa-request iafa-out) on Wed, 4 Mar 92 17:06:19 -0500
Received: by expresso.cc.mcgill.ca (NeXT-1.0 (From Sendmail 5.52)/NeXT-1.0) id AA12411; Wed, 4 Mar 92 17:06:04 GMT-0500
Message-Id: <9203042206.AA12411@expresso.cc.mcgill.ca>
In-Reply-To: Tim Berners-Lee's message as of Feb 27, 17:22
From: Peter Deutsch <peterd@expresso.cc.mcgill.ca>
Date: Wed, 04 Mar 1992 22:06:03 -0000
In-Reply-To: Tim Berners-Lee's message as of Feb 27, 17:22
X-Mailer: Mail User's Shell (6.5.6 6/30/89)
To: Tim Berners-Lee <iafa-request@kona.cc.mcgill.ca>, chi-arch@uccvms.bitnet, www-talk@nxoc01.cern.ch, wais-talk@think.com, iafa@cc.mcgill.ca
Subject: Re: Draft: Universal Document Identifiers
Cc: Rare WG3 <rare-wg3@surfnet.nl>, nisi@merit.edu
Tim, Here's a response to your "Universal Document Identifiers on the Network" piece. I was late, I was tired, and I originally sent it to the irtf-rd list. Sigh. Apologies to anyone who is thus seeing it again. For anyone who will be at the IETF meeting in San Diego in two week, I've tentatively put the issue of "Universal Document Identifiers" vs what I call "Universal Document Serial Numbers" on the agenda for the Living Documents BOF on Tuesday night. - peterd ---------------------------------------------------------------------- g'day all, This is a note to bring together some thoughts I've had on the Unique Document Identifier question (UDI question for this posting). In particular, I have just finished going over Tim Berners-Lee's recent discussion piece and have some thoughts and another view of things. My basic proposal is only half-baked at this point, but I'd like to circulate it before the IRTF meeting and Living Docs BOF at the upcoming IETF to see if I ring a bell with anyone. The Problem with Unique Document Identifiers ---------------------------------------------- I've had some trouble with the various postings and papers I've seen discussing the possibility of creating a Unique Document Identifiers scheme (I remember one by Brewster Kahle, and the more recent posting by Tim Berners-Lee in particular, although I'm not citing anyone as causing my troubles - it may just be me! :-) Without trying to repost everything I've seen, and probably therefore in the process dangerously oversimplifying what I've read, I'd summarize schemes such as the W3 naming scheme covered by Tim in his recent paper ("Universal Document Identifiers on the Network") as revolving around encoding: a) a naming scheme identifier, followed by b) an address whose format is dependent upon the naming scheme. the other UDI schemes I've seen all seem to take a similar approach. Basically they all appear to be defining a method for encoding multiple types of "physical addresses" of some sort, where physical address refers to a way of specifying enough information in the particular naming scheme to allow the reader to access the document. These are in effect "multi-disciplinary document pointers", allowing us to specify documents in multiple formats. The problem I see is that two files on the Internet, stored at different locations but containing identical information, would still have two different Unique Document Identifiers. Now, you can get around this problem by only storing pointers to original documents wherever possible, but it seems to me that sooner or later (whether it be lazy site administrators, the wish to implement mirroring strategies to provide robustness, or whatever) we end up with multiple copies of the same document with differing UDIs. Contrast this with the librarians' ISSN or ISBN numbers. Each and every copy of "The Complete Fawlty Towers" by John Cleese and Connie Booth carries the same ISBN number (ISBN 0-413-18390-4 for those who don't have it yet :-) because each and every one is merely an instantiation of the same collection of information (Actually, this is not _quite_ true, since I believe that the softcover edition would have a different number, but assuming no softcover edition this is essentially correct). I think the distinction between what we have been doing, and what the librarians are doing is an important one and drives to the heart of my objections. I have a problem with speaking about the need for Unique Document Identifiers, and then going into detail about schemes that seem primarily adapted to encoding access information without ensuring that multiple documents are in fact unique. Although the schemes proposed do solve some people's problems, they don't seem to be solving the problem that I would want addressed by UDIs. They do not even follow the model I would choose to use to address the problem I want solved. For example, when speaking of the structure of a physical address, Tim speaks of needing a physical address to allow: - a user's program to contact the server - The server to search and index or retrieve the document - The user's program to locate an individual position or element within a document. But what about identifying the actual document as the one I really want? What about identifying multiple instances of the same document under different access schemes or encoding methods? What about differentiating between different documents with similar "names" on different hosts? Hopefully we can allow users to do all of this without actually forcing them to examine the contents of each candidate document in turn. So this brings me to what _I_ need UDI's for. As the architect of an information indexing and resource discovery system I'd like to make it easier to build and operate such tools, and I'd like to make it easier for my users to find things in my system. I'm particularly interested in the early phases of resource discovery, when users are performing class discovery, and then instance identification. After all, in many cases actual access of a document comes later, if at all. In particular, when someone comes looking for a particular document that exists on a multiplicity of sites, in a variety of encoding formats, accessible in a variety of ways, I'd like to be able to indicate to them when multiple hits on a search are actually referring to the same information _contents_, regardless of encoding method. I'd also like to be able to let them know when similar looking documents actually contain different information. I accept that people in real life want to make copies of documents and I want to help them in this. As far as I'm concerned, at least initially, it is the _content_ of a piece of information that I want identified uniquely, not the location of an instantiation of this contents. Of course, I'd also like eventually to be able to tell my users _where_ such documents live and how to access them, but in the initial phases of discovery users (or their tools) are having to weed out false hits, and I don't see a lot in the general schemes that people are working on that helps in this. + In an nutshell, I'd like to see the emphasis move to + + uniquely identifying _content_ rather than encoding + + of location and access method for UDIs. + As a concrete example, the document "RFC-1296" (a recent RFC describing the explosive growth of the Internet over the past 10 years) exists on the Internet. Perhaps it exists in both ASCII and Postscript formats at a number of archive sites (In fact, I've only seen this RFC in ASCII, but play the game with me for a moment). I'd like _my_ UDI scheme to somehow indicate to me that the various instantiations of this document ("rfc1296", "RFC-1296.txt", "RFC/1296.ps" etc) are actually all the same document, merely stored in different formats and locations. Of course, I'd also like to know when two files having the same name actually have different contents. Now, certainly the work on UDIs that has already appeared is addressing a real problem so perhaps I need to come up with a new name for what I want, to avoid confusion. + Thus, I propose to call what I am discussing here + + "Unique Document Serial Numbers" (UDSNs) to + + distinguish them from the existing UDIs. + UDSNs would have the property that they could uniquely identify information objects across encoding schemes, naming schemes and file system representations. Thus, accessing RFC 1296 interactively through Gopher, programmatically through Prospero or through email to archie should always return me some indication that the content is the same when I find multiple instantiations, provided the document has not been altered other than in representation or encoding. It seems to me at this point that creating a UDSN scheme would not be difficult if we can set up the right model. I would begin by modeling information as a series of objects having a specific set of attributes. In particular, we require the concept of "owner" or perhaps "author", a logical entity corresponding to the creator of that particular information object. The UDSN is now an attribute that takes as a value an ordered number unique to the collection of UDSNs for that owner/author. Each document can have only one owner, and a single owner can have zero or more information objects. Each time the owner creates a new information object (either from scratch or by modifying an existing information object) it is assigned the next UDSN number for that owner in sequential order, start at UDSN 0 and increasing for each document. Whenever someone actually copies (as opposed to referencing with a hypertext-like pointer) such an information object, the UDSN is copied with it. It must become an inseperable attribute, permanently associated with that information object. Note, that this scheme would require some special housekeeping if we tried to allow the single owner to correspond to some meta-user on the Internet, allowed to roam from site to site. This could be avoided in a basic scheme (at least in the current Internet environment) by specifying an owner as a username/hostid pair, with hostid corresponding to a fully qualified domain name and username corresponding to a unique userid on that hostid (presumably, single user machines would only issue one such userid). Although this would work as an initial implementation, I think it has obvious inherent limitations and would like to see it avoided if possible. In particular, I don't want to have to start a new series of UDSNs just because I moved to a machine in Australia while on holiday. I want "meta-peterd" to keep generating UDSNs in the same series wherever I'm currently working. Drawing on Cliff Neuman's Virtual System Model I'd postulate the existence of a "Virtual System User", one whose home would follow him around the net, as would his UDSN generator. In this case we could also speak of Unique User Identifiers (UUI?), as well. Details on this meta-user is left for another time. A UDSN would thus consist of a single UUI (guaranteed to be unique in the particular information space by whatever means we come up with to implement this idea) and a sequential UDSN, guaranteed to be unique for that particular UUI. Together they would be associated with any information object available from your information delivery system. They could in fact be incorporated into the UDI ideas already proposed. Now, people searching archie for "RFC1296" could get a variety of matches, with varying UDI, but from the UDSN portion users would quickly identify duplicates (in fact, their user agents could do that for them, sorting and even collapsing duplicates if desired). People searching through Prospero finding multiple files with similar names would be able to identify copies of the same document without having to access each of them in turn. Searching such a system for matching UDSNs would quickly give us some measure of the scale of duplication on the Internet... Of course, the moment someone edited or modified a document object, the UDSN would have to change, and we would have to expect some assistance from our work environment to handle such things as UDSN assignment, object/attribute editing, etc. I've already described my idea of an "info daemon" to handle Internet electronic publishing at the last IRTF get-together and this seems like an ideal job for such a tool. When registering a document, it would assign the UDSN, binding it to the object before making it available to the network. The act of assigning what in effect are per user ISSN/ISBM numbers would be a important part of the act of "electronic publishing". The big objection to all of the above that I might see is that given the UDIs already postulated by Tim and others such a means for distinguishing duplicate and dissimilar documents is unnecessary. If everyone stored only pointers to the actual documents, UDSNs appear unnecessary. Such an objection assumes that given a suitable UDI document pointer mechanism we can eliminate duplication through "socialization" of our users. This may be true eventually, but it certainly wont happen in the short term. I suspect that in practice it wont ever be completely possible. Certainly if we had UDSNs _now_ tools such as archie would be that much more useful. I just can't see the need for them going away any time soon. I'd like to pursue this idea, and I am wondering if this appears to be of use to anyone else? Okay, that's it for now. Kudos to those who made it this far. Comments anyone? Have I missed the point? Am I barking up the wrong tree, or have I misunderstood the current discussion on UDIs? Enquiring minds want to know. - peterd ---------------------------------------------------------------------- This posting elicited several responses. Here is a reply to one by Mike Schwartz. > From schwartz@latour.cs.colorado.edu Sat Feb 29 07:25:11 1992 . . . > I think the problem is a good deal harder if you want to actually do this > in a conceptual fashion (so that, for example, rfc1234.ps is considered the > same as rfc1234.txt. Even harder: so versions of these documents or > variants are somehow represented as being related). . . It becomes hard if we try to make the UDI/UDSN also handle version control and representation information, but actually assigning unique document serial numbers should not in and of itself be all that difficult. A unique string needs to be generated (your CRC signature idea, perhaps coupled with timestamp) and a signature for the author needs to be generated and attached to the document when it comes into existence, then preserved as the document is copied around and we'll have a start. The thing I want is that we see such a serial number as a signature of the _contents_, not just a "universal encoding" of the filename. Even if that's all we had, people doing the first steps of information discovery could get a lot out of this. If we use just CRCs, as you suggest, then they can actually just be generated as needed, and not stored, which might be useful in some cases. Of course, tools such as interfaces to ftp should also be expanded to let us query these CRCs without bothering to copy the file over to perform the calculation (Perhaps we need yet another argument to ls? (!) I can see it now: " ls -# <filename> lists the file's CRC signature" :-) I still would like to see us include an author ID, which _would_ have to be attached and carried around (for why, see below) but if all we want from document serial numbers is to allow us to compare document signatures all we really need is uniqueness of IDs and a CRC signature may be enough for this in practice. Version control and representation information are still definitely useful, so how do we get them if not by incorporating them into our serial numbers? Now I've thought about it a bit (ie since last night... :-) I don't think we need or require such things be included at the lowest level with the doc serial numbers themselves. After all, on the UNIX file system we can distinguish between copies of a file and multiple links to the same file by comparing inode numbers, but that doesn't mean version control and representation information must also be stored in the inode. We can figure out such things using information available to us "higher up". If we just want to know if two references were to the same file, we can compare their inode numbers without demanding that this number also tell us about previous versions that existed or whether this is a Postscript copy of some ASCII file in another partition. I can imagine an SCCS-like file system that manages changes and maintains information about previous versions of files, for those who need or want it. One approach was outlined at the last Usenix when someone presented details of a file system that automatically tracked changes over time (it was something like the "3-D file system" but I don't have my proceedings here at home to confirm details). In any event, the system lets the user access any file by name _and date_. Such an idea could be used to track changes to files over time. And what about representational changes that don't change information contents? Can we recognize actions that change representation, but not content? Maybe not algorithmically, since I can create a Postscript version of an ASCII doc by loading it into something like Framemaker and then saving in a different format and name so unless all such programs are changed it doesn't appear feasible to distinguish between real editing and simple representational changes in such programs. Still, we _can_ postulate the existence of some form of "information broker", who has the job of managing UDIs and UDSNs. A simple version change or file copy would be registered with the broker. For version changes, the file gets a new UDI and a new serial number, but the internal revision history is updated. For file copies the broker would assign the new file the same serial number, but different UDI. Any action that brings a new document into existence would also generate a new UDI and a new serial number. If we accept the idea of doing representation and version control at this higher level then we can manage representational changes by "checking out" a copy of a document when we want to perform a conversion. In such a scheme, the parent document is handed over, and a converted version is handed back, with a promise that the content is unchanged except for its rendering. The broker now preserves the serial number and grants a new UDI. Conceptually, I like to think of documents as "belonging" to their authors and so I imagine this broker as an "author's publisher" that is responsible for managing document objects on my behalf. I have already postulated such a broker to receive information requests from the Internet for my host and am happy to have it do double duty as my UDSN manager, but there is nothing that requires an active process if we want to incorporate the functionality into our favorite file system. The advantage of having an active process (if we have incorporated the author's ID into the UDSN) is that we can always contact the author's broker for the version and representation information, if we want it. Thus we get back to what I wanted originally (contents identification, version control, representation isomorphism). All it takes is an author ID of some form and agreement on how to translate that into a broker address (X.500, anyone?). Separating out the version control and representation management functions like this does appear to have several advantages, not least of which it could be retrofitted on top of most systems without changes to the O/S level functions. It would also allow us to deploy the CRC-like signatures _now_ since that's easy. Incorporating the whole thing at a lower level is possible, I just don't see it as necessary to get started. This is certainly miles away from initial concerns for such things as UDIs, but I offer the idea as one possible solution that can coexist with my serial numbers without requiring any retrofitting of existing systems. Of course, if we can solve the problems I'd be happy to have a serial number that allows me to trace change history, parentage and representational changes as well as distinguish objects by their contents. I'm just not that ambitious. :-) > . . .If you are willing to > settle for unique identification of the bytes in a document (overcoming > naming differences on different machines but not representational or > version differences), you could use some sort of hashing or cryptographic > checksum. This is essentially what I did for file signatures in my FTP > traffic measurement study, to decide when identical files were being > transferred multiple times. The win here is that you can get the IDs in an > automated fashion. As so many of our prototype systems have shown, it's > much easier to bootstrap a system into existence if you can run a program > against existing data to generate the indices you need, rather than > requiring global administrative agreement and manual human registration of > the index information. This would allow us to go back and "sign" whole archives, which would be a real win. There's nothing stopping us from doing this today, and incorporating the signatures as a part of the information offered. Something for us to look at with IAFA? > . . . Obviously, what you want would be preferable in the > long run, but then maybe you'll need to attend standards meetings for 2 > years before it's ready for mass consumption (and, you should probably talk > with the library science folks) I'd be happy with a simple agreement to use CRC signatures for now. I can't see myself surviving 2 years of standards meetings! Actually, if I build and deploy my broker, then when the standards people are finally done I can upgrade to whatever was agreed upon. Meanwhile, I'm in business today. The idea of talking to library scientists is a good one, too. I'm going to the Net92/CNI meeting in Washington in a couple of weeks and hope to broach the subject there. I don't know enough about ISSNs/ISBNs but they do seem to have a lot of the traits I want. Maybe what we really need to do is adopt them with the librarians' blessing? - peterd --
- Draft: Universal Document Identifiers Clifford Neuman
- Draft: Universal Document Identifiers Tim Berners-Lee
- Draft: Universal Document Identifiers Simon Edward Spero
- Draft: Universal Document Identifiers ses
- Re: Draft: Universal Document Identifiers Edward Vielmetti
- Re: Draft: Universal Document Identifiers Peter Deutsch
- Re: Draft: Universal Document Identifiers Tim Berners-Lee
- Re: Draft: Universal Document Identifiers Steve Hardcastle-Kille
- Re: Draft: Universal Document Identifiers Peter Deutsch
- Re: Draft: Universal Document Identifiers Albert Langer
- Re: Draft: Universal Document Identifiers Larry Masinter
- Re: Draft: Universal Document Identifiers Tim Berners-Lee
- Re: Draft: Universal Document Identifiers Peter Deutsch
- Re: Draft: Universal Document Identifiers Larry Masinter
- Re: Draft: Universal Document Identifiers Peter Deutsch
- Re: Draft: Universal Document Identifiers Albert Langer