Part IETF trip report: networked information retrieval

Jill.Foster@newcastle.ac.uk Fri, 10 April 1992 16:31 UTC

Received: from nri.nri.reston.va.us by ietf.NRI.Reston.VA.US id aa01755; 10 Apr 92 12:31 EDT
Received: from nri.reston.va.us by NRI.Reston.VA.US id aa17919; 10 Apr 92 12:35 EDT
Received: from bells.cs.ucl.ac.uk by NRI.Reston.VA.US id aa17880; 10 Apr 92 12:34 EDT
Via: bells.cs.ucl.ac.uk; Fri, 10 Apr 1992 15:26:18 +0100
Received: from newcastle.ac.uk by eros.uknet.ac.uk via JANET with NIFTP (PP) id <28337-1@eros.uknet.ac.uk>; Fri, 10 Apr 1992 15:06:53 +0100
Received: from uk.ac.ncl.mts by ncl.ac.uk; Fri, 10 Apr 92 14:33:57 +0100
Date: Fri, 10 Apr 1992 14:33:09 +0100
From: Jill.Foster@newcastle.ac.uk
Subject: Part IETF trip report: networked information retrieval
To: osi-ds@cs.ucl.ac.uk
Message-Id: <emu-ov07.1992.0410.143309.cl54@uk.ac.ncl.mts>

 The following is an extract from a trip report on the IETF (Internet
 Engineering Task Force).  These notes cover ONLY the discussions on
 networked information retrieval.
 
 The full report may be obtained from mailbase (See below).
 
 Note from report:
 
 The following informal report is in note form and deals mainly with
 the areas of User Support and Networked Information Retrieval.  Whilst
 it is as accurate as I can make it, it is naturally a personal account
 and may be inaccurate due to lack of background information or
 misinterpretation of what I heard.  Corrections of fact are welcome,
 but any discussion of items contained here would be best directed to
 the appropriate mailing lists.  (In particular the nir mailing list
 mentioned below which is now operational).
 
 This report will be stored on the UK Mailbase Server.  To retrieve a
 copy, email to Mailbase@mailbase.ac.uk with the following command in
 the body of the message:
 
      send rare-wg3-usis ietf.03.92
 
 
 The sections on networked information retrieval follow.
 
 Jill Foster - Newcastle University, UK
 Chairman: RARE WG3 USIS Subgroup
 
 
 Networked Information Retrieval
 ===============================
 
 This was discussed in the following groups:
 IAFA (Internet Anonymous FTP archives), Living Documents BOF and
 WAIS/X.500 BOF.
 
 Each and every network user has the possibility of publishing
 information widely on the network.  As the Internet grows rapidly, the
 problems of resource discovery and networked information search and
 retrieval increase daily.  Several groups have (initially)
 independently tried to tackle some of the problems.  One of the major
 attractions of this IETF (from my point of view) was that many of the
 major players in the NIR arena would be in attendance and that two
 BOFs (Living Documents and X.500/WAIS) were being held to discuss
 various aspects of NIR.  The groups concerned included:
 
      Archie people:     Peter Deutsch and Alan Emtage
      World Wide Web:    Tim Berners-Lee
      Prospero:          Cliff Neuman
      Gopher people
      X.500 group:       Steve Kille, Paul Barker, Wengijk Yeong
      as well as representatives from
      CNI architectures group: Clifford Lynch
 
 Leading up to the BOFs there were several informal sessions over lunch
 and dinner and in the terminal room.
 
 
 Living Documents BOF
 ====================
 
 The Living Documents BOF was originally intended to address the
 problem of managing documents that are continually updated (such as
 the NOC-tools RFC, the user-bibliography, user-glossary etc).  However
 it developed (as expected) into a wide ranging discussion and brain
 storming session on the problems of resource discovery and information
 retrieval.
 
 There had been long discussions on a number of mailing lists leading
 up to the IETF.  Peter Deutsch had proposed a UDSN (Universal Document
 Serial Number) which should be the equivalent of the ISBN for books.
 This would be a contents ID or fingerprint and would enable several
 instances of the same information to be recognised as being
 equivalent.
 
 There was discussion on what constituted equivalence rather than a
 derived work.  Were postscript and ascii versions of the same file
 equivalent?  (Most thought yes).  But what if the postscript versions
 contained diagrams or graphics not in the ascii version.  (What if it
 was translated into another language?  etc.....)
 
 For each "document" there was a need for:
 
 o    Catalogue information (Title, author, creation date etc.)
 o    Location and access information
 
      Also required:
 
 o    USDN
 o    UDI  (Universal Document Identifier (See later))
 o    Authentication and access control
 o    Version control
 o    Editorial control
 o    Discovery mechanisms
 o    Ability for information providers to publish/ announce
      items/document
 
 One possible USDN would be a MARC record, however there are several
 standards here (US (several) UK ...etc.) Clifford Lynch (CNI
 architectures group) felt that use of MARC was not really appropriate
 here in any case.  Amongst other problems discussed was the need to
 refer to bits of documents.  However this discussion was shelved as
 the problem of dealing with complete documents should be addressed
 first.  There is a real need for librarians to bring their expertise
 to these issues.  The Coalition of Networked Information (CNI) is
 working on doing just that.
 
 There is a short term need to be able to determine whether two
 documents are the same (UDSN) and the need to have a top level
 globally unique name to refer to one instance of a document (UDI).
 
 It was agreed to set up an nir discussion list
 
      nir@cc.mcgill.ca
 
 to discuss these issues further.
 
 
 IAFA: Internet Anonymous File Archives Working Group
 ====================================================
 
 A document had appeared shortly before the IETF.  Briefly it detailed
 how information about the files in a public file archive could be made
 available.  The current problem is that tools such as Archie are not
 able to discover automatically detailed information about a file
 (apart from its name).  The proposal is to have information about the
 file archive and a "file catalogue card" containing various attributes
 of the file (including keywords and a description or abstract)
 available as a separate file either in the same directory as the file
 or in a shadow directory.  The various attributes to be included on
 this catalogue card were discussed and the paper will be updated in
 the light of this.
 
 I mentioned the Draft RFC from the OSI-DS WG on "Representing Public
 Archives in the Directory", and recommended that the attributes
 required for registration in the directory should be included in the
 IAFA Archive description file.  I suggested the idea of a Quality of
 Service attribute.  Some Services have a high availability and are run
 by professionals, other archives are run on a best endeavour basis by
 volunteers.  A further suggestion was the need to be able to register
 logical archives.  That is separate archives that happen to reside on
 the same machine.
 
 
 X.500/WAIS BOF
 ==============
 
 This was really a companion BOF to the "Living Documents" BOF which
 had seen a wide ranging discussion on networked information retrieval.
 In contrast this BOF was more structured and started with
 presentations on the various applications.  There is a need to have
 some sort of Universal Document Identifier that could be used by the
 various applications.
 
      WAIS:
      ====
 
      John Curran provided a short description of WAIS.  (Unfortunately
      no one from "Thinking Machines" was able to attend the IETF).
      However John had a reasonably good knowledge and experience of
      the application (NNSC have a WAIS interface to the RFCs).
 
       _______________             _______________     _______________
      |               |           |               |   |               |
      |               |           |               |   | Files of      |
      |               |           |   WAIS        |   | Information   |
      |   Client      | --->----- |   Server      |   | (e.g. RFCs)   |
      |               |           |               |   |               |
      |               |           |               |   |               |
      |_______________|           |_______________|   |_______________|
 
 
      The WAIS Server has an inverted index of all the words in a
      document which is pre-built.  (This does not make sense for
      non-text files of course).  It also holds other information about
      the document (size etc).  A client will formulate a query on
      behalf of the user and send it to the WAIS Server which will
      search the index and retrieve and return the document using the
      same protocol (Z39.50).  Use of a pre-built index makes this very
      fast.
 
      One WAIS Server may have multiple sources (and multiple indexes).
      There are various WAIS Servers in existence, but there is
      currently no way of querying which Server is responsible for
      which source.  The possibility of putting WAIS descriptor files
      on a Server or in an X.500 directory was discussed.
 
      Differences: Z39.50/WAIS:
      WAIS specifies how a query should be formulated (Z39.50 does not)
      WAIS uses Z39.50 (slightly modified) as the transport protocol.
      WAIS also provides relevance feedback.
 
 
      OSI-DS 22:
      =========
 
      Wengiyk Yeong presented his draft RFC on representing a public
      archive in the directory.  He also described a project using
      this.  A file can be found using the directory and then
      automatically retrieved using the specified access method.
 
      World Wide Web:
      ==============
 
      Tim Berners-Lee gave a talk on the World Wide Web.  This project
      has been funded to provide a service to the world wide community
      of high energy physicists.  It is a hypertext system.  The
      philosophy behind it is that a user should be able to point and
      click on an item name or a word within a document and the
      associated document would be retrieved from wherever in the world
      and presented to the user in an appropriate format - without the
      user having to be aware of where the document is located or what
      the access method is.  These details are hidden in the hypertext
      links.  There were server programs for many information servers,
      gateways to WAIS, Archie and gopher and client programs for
      various user machines.
 
      The overlap between WWW, WAIS, Archie, Prospero was indicated and
      the need for a UDI by all of these was discussed.  Each
      application (apart from WAIS) uses a "handle" for a file which
      can be prefixed by something appropriate.  WAIS currently can
      only have "WAIS" as the prefix.  There is a need for it to be
      more flexible.
 
      Mailing lists: WWW-interest@nxoc01.cern.ch
                     WWW-talk@nxoc01.cern.ch
 
 
      OSI-DS25:
      ========
 
      Steve Kille discussed this paper "Representing the Real World in
      an X.500 Directory".
 
      A Listing Service may be used to group like information items
      together for example to provide a Yellow Pages Service.
          Could represent members of a special interest group.
          Group Documents on a particular subject.
      Services such as Archie could be considered to be Listing
      Services.  One imagines an information Universe in which
      Information Brokers provide different subject based (say) views
      via their listing service.  One would then need to locate the
      various listing services (using a mechanism such as a directory?)
 
      OSI-DS mailing list: osi-ds@cs.ucl.ac.uk
      Subscriptions:       osi-ds-request@cs.ucl.ac.uk
 
 
      UK British Library Project:
      ==========================
 
      Paul Barker described a project, sponsored by the BL, to
      represent grey literature (unpublished research papers) in the
      Directory.  The project is thought to be unlikely to succeed -
      but one of the aims is to demonstrate whether or not it is
      possible.  They will take the (UK) MARC records and model these
      within X.500.  They might also consider trying to provide a
      listing service so that the documents might be retrieved more
      readily by subject area.
 
 
      Prospero:
      ========
 
      Cliff Neuman described Prospero.  It follows a file system
      (rather than hypertext) model.  It is built on UDP.  It has the
      notion of a Directory which contains links to other objects
      (other directories or files).  It returns the link to the
      information object and then automatically retrieves the file by
      another mechanism by the appropriate access method (Archie, WAIS,
      nntp, WWW - soon!, NFS, ftp etc.) It has linked very successfully
      with archie.  Cliff stated that he expected to be able to use
      X.500 to translate between the document ID and how to get the
      document.  With Prospero the user has his own view of the global
      information base (or has a view built for him).  Cliff thought
      there should be multiple name spaces - but the difficulty would
      be that these would need representing near the top of the
      directory tree.  With multiple user chosen views - this would be
      difficult to manage.  Also two users might refer to an object by
      different handles which would be relative to their individual
      name spaces - difficult when passing references (say in a mail
      message) from one person to the other.
 
      Mailing list info-prospero@isi.edu
 
 
      System 33
      =========
 
      Larry Masinter talked about a project at XeroxParc.
      There was the concept of a
      -  HANDLE
              32 byte number (is a content ID)
      -  FILE Location  (6 part)
              Protocol; Host; Path; piece; format; timeout
      -  Description
              (normal "Catlogue" information
               name:
               Author:    etc.
      -  Document
 
      There is format negotiation when a document is retrieved.
 
      Also considered Access Control.  ACL is part of description.  The
      Server exploits multiple protocols for Search and retrieve.
 
      There is a problem with dealing with different types of document:
 
         -  applications for jobs
         -  product specs.
         -  memos
         -  contracts
         -  faxes
         -  etc.
 
      It is difficult to normalise the attributes of a general
      document.
 
 
      Summing up
      ==========
 
      Tim Berners-Lee summed up by saying that all applications
      described had a need for a Unique Doc ID and for a name service
      for this.  The UDI needed to be resolvable.  (This is not the
      same as USDN - content ID - described earlier).  There should be
      a WG on details of UDI (but this needs a better name) and a
      separate one for USDN (and the need for a single resolver for
      these).  Chris Weider agreed to co-author a document on the
      issues.
 
      I suggested that it might be useful to try just doing this.  That
      is to have a pilot-project to try putting UDI's in the directory
      for a set of files and to have the gopher, Prospero, archie, and
      Prospero people try to utilise these.
 
 
 
 Concluding Remarks
 ==================
 
 So all in all a very worthwhile meeting.  The problems of NIR have
 been aired.  The various players in the field have been made aware (if
 they were not) of the work of the others.  Some plans for practical
 collaboration have already been formed.
 
 These issues will be discussed further at the Joint European
 Networking Conference in May, RARE WG3 USIS meetings, future IETF
 meetings and of course on the various mailing lists.
 
 Further links have also been made between the IETF User Services Area
 people and RARE WG3 USIS members, which will enhance collaboration.
 
 Finally, a reminder that these notes are my view of the IETF.  They
 may not be an accurate view, and certainly do not cover the wide range
 of topics discussed at the workshop.
 
 Jill Foster
 09.04.92