New URM paper with additions!

Michael Mealling <ccoprmm@oit.gatech.edu> Wed, 20 October 1993 00:31 UTC
Sender: ietf-archive-request@IETF.CNRI.Reston.VA.US
From: Michael Mealling <ccoprmm@oit.gatech.edu>
Message-Id: <199310192100.AA10993@oit.oit.gatech.edu>
Subject: New URM paper with additions!
To: uri@bunyip.com
Date: Tue, 19 Oct 1993 17:00:29 -0400
This is a new version of the paper I submitted in Amsterdam. I will be
in Houston this time to defend it. It is also available via
<http://www.gatech.edu/urm.paper>. If you have any questions please
copy the list so everyone can benefit.

----
Michael Mealling
Michael.Mealling@OIT.gatech.edu
Georgia Tech
July, 1993

Uniform Resource Identifiers: The Grand Menagerie

(NOTE: This paper makes the assumption that the intended audience has
working knowledge of URIs and the past work of the URI-WG of the IETF.  It also
uses names for things that as yet have not been agreed upon within the working
group.  If you don't agree with what a specific entity is called then please 
insert your favorite TLA where needed.) 

1: Introduction

Currently, there are two issues facing the URI working groups: encoding of
meta-information and Uniform Resource Name (URN) to Uniform Resource
Location (URL) resolution.  The first is causing considerable trouble within the
working groups because meta-information is by far some of the most important
information to the user.  The second, while not as volatile as meta-information,
will soon be very important as many people start using the new URN
specifications in real applications.  (NOTE: For the rest of this paper the act
of resolution will be depicted with the "->" notation, i.e.: URN->URL means URN
to URL resolution.) 

Presented here is a set of items that should offer an acceptable solution to
both problems.  For meta-information the author proposes the creation of an
additional URI entity called Uniform Resource Meta-information (URM).  This
entity will be used to encode meta-information such as filesize, type, title, 
author and version.  The URM does not completely solve this problem though.  
How do you associate a URM with a given URL or URN?  This is where a Uniform 
Resource Template comes into play.  A URT is simply a template with only three 
valid attributes: URN, URL and URM.  The URT solves two problems: URN/URL/URM
encapsulation and URN/URL/URM resolution and transport.  Resolution and
transport exploits the fact that a URT is a template that can easily be used by 
a whois++[1-Fullton] server to search for a given URN.  

2: The Uniform Resource Meta-information (URM)

2.1 Functionality

The URM is designed to provide for a non-persistent meta-information encoding
scheme.  It is meant to be used in conjunction with items called transponders 
[1-Weider 1993] and other network resources that maintain and use information
that describe the resource itself.  URMs are meant to be used specifically in
conjunction with URLs as a locally cached entity used to cut down on the number
of times a client requests information from the network.  They are meant to be
human readable as well as machine readable.  This means that certain fields can
have specific internal syntax, but that this internal syntax is not to be 
defined here.  This allows for machine readable data to co-exist beside human 
readable data.  One caveat to this is the possibility of having encoded data 
within the URM.  While this probably will happen due to transmission encoding 
problems it should not be encouraged.  

2.2 URM Sections Explanation

A URM (like URLs and URNs) has distinct sections to it: the wrapper, the 
encoding format scheme, and the list of encoded items.  The syntax is:

URM:Format_Scheme::"Data_Item"::"Data_Item"...::"Data_Item"::: 

2.2.1 The wrapper

The wrapper consists of the 4 character header "URM:" and a 3 character trailing
":::" with items in between these two.  Note that, unlike a URN in which the 
trailing colons are required, a URM doesn't really need to have the trailing 
colons.  The end of a URM can just as easily be a double quote followed by a 
carriage return.  The 3 colons are simply meant as a standardization.  If the 
working group decides that the ending wrapper is not needed, then dropping it 
from this specification alters nothing.  

2.2.2 The Format_Scheme

The Format_Scheme is made of of three fields: the Format, language, and
character set specifiers.  Their format is:

Format:language.character_set

Format is a single identifier that is made up of allowed meta-information
encoding schemes.  Recognizing both that other encoding schemes exist but that
too many encoding schemes renders a URM useless it is suggested that a very
limited number of encoding schemes be allowed and that those allowed be
registered with the IANA.  This is for discussion among the IIIR working groups 
[2-Weider] .  This paper puts forth one as a good solution to most encoding
problems.

The IAFA working group of the IETF has developed a very large list of field 
names and allowed data elements that are used to describe the various 
attributes of an item on an FTP site.  This list is comprehensive enough to be 
used as a URM encoding scheme.  This paper suggests that the identifier string 
'IAFA' be used as a Format_Scheme.  This is dependent on the work of the 
"Data Elements Working Group" that may or may not exist at this writing.  
Realizing that the data elements may or may not be an attainable goal some have
put forth the use of SGML as a method of encoding information without breaking 
everything.  This would simply mean that, instead of "IAFA", the identifier 
would be "SGML".  This paper only makes a few suggestions as to URM encoding 
schemes.  URMs are a method to allow us to "black box" meta-information so the 
Working Groups can get something out that is useful.  

Language specifies which language the resulting encoded information is in.  This
is specified in one of two formats.  The first uses ISO 639 country and ISO 3316
language codes.  The second uses the value "MIME" as a specifier which denotes
that the value within the double quotes is a MIME encoded set of
meta-information.  This allows for other character formats to be encoded 7-bit
clean to allow for easy transmission.  The actual format of the data within the
MIME package should only be one of the allowed Formats from the initial portion
of the Format_Scheme.  Below are two examples: The non-MIME format is:

languagecode_countrycode

For example, British English would be represented as: 

en_UK

Character_set should be the ISO name for each allowed character set.  An
example would be (NOTE: This makes the URM non-7bit-clean!):

IAFA:en_US.iso88591

The following example forces the encoded MIME data between the double quotes
to be a MIME encoded IAFA template.  This assumes some sort of cooperation
with the MIME working groups to specify what a MIME encoded IAFA template
looked like.  

IAFA:MIME

2.2.3 The list of encoded items

The list consists of one or more data items surrounded by quotation marks and
separated by double colons.  This is the section where the actual data is 
encoded.  White space of any type is allowed here.  If quotation marks are 
needed within these items, then they should be quoted with a '\' in the C 
style of special character quoting.  It should be noted that some transport 
protocols put restrictions on white space and non-printable characters.  These 
should be taken into account when transporting URMs around the net.

An example follows:

URM:IAFA:en_US.iso88591::"Author: John Doe"::"
Title: \"My Book\"
"::"
Format: PostScript
":::

(Note the Carriage Return at the beginning and end of some fields.  This is 
simply an illustration of the inclusion of non-printable characters.)

2.4.  Syntax Specifics

Below is a BNF-like syntax for a URM.  Where spaces are allowed they are listed 
in addition to other characters.  Square brackets '[' and ']' are used to 
indicate optional parts.  Single letters and digits stand for themselves.  All 
words of more than one letter are either expanded further in the syntax or 
represent themselves.

urm             URM:Format_Scheme::Item[::Items]:::
Format_Scheme   Format:Language:Character_Set
Format          ascii
Language        isoLanguageCode_isoCountryCode
isoLanguageCode ascii
isoCountryCode  ascii
Character_Set   ascii
Items           Item [Items]
Item            "xalphas"
alphas          alpha[alphas]
xalphas         xalpha[xalpha]
xalpha          alpha[:]
alpha           any character defined in any iso recognized character set
                except for a ':'
ascii           any printable ascii character except ':'

3: The Uniform Resource Template (URT)

3.1: Functionality

A URT is a method for showing relationships between URNs, URLs, and URMs and
encapsulating them all within an entity that can be passed around as one token.
It utilizes simple parsing rules based on URNs having precedence over URLs and
URLs having precedence over URMs.  This allows a URT to contain most of the
information needed about a given network resource in one cache-able chunk of
data.

The format of a URT exploits the format of each individual component.  URNs and
URMs both start with an identifier ending with a colon.  By allowing for a 
"URL:" wrapper for URLs we end up with a list of components that naturally fall
into template format.  

We can use the template format to our advantage by using the whois++ server
protocol as a way to resolve URNs to URLs and to cache meta-information with
URLs to make network access more efficient.  Also, with the use of centroids, 
the URTs can be searched globally (this depends on whether the IIIR group
decides if centroids scale or not).[1-Fullton]

NOTE: Many on the mailing list have expressed concerns about requiring the
URL: wrapper for URLs.  Tim Berners Lee has pointed out that URN: and URM:
are nothing more than other transport schemes similar to http: and gopher: in
URLs.  This is an acceptable change since it would only require every client to
know which items were URNs and which were URLs.  The only problem the
author sees with this is that new URIs would require software updates in
order to know what the new URI was.

3.2: URT Contents

A URT can contain any number of URNs, URLs and URMs delimited by the
associated wrapper for each URI.  The order of each of these in the file is
important since that is how a client would determine which URL went with which
URN.

What is not apparent is why multiple URNs should be allowed in the same file.
This is useful for caching information about related resources.  For example, 
the URT for The Declaration of Independence could also include a URN (and
associated URLs and URMs) for the Federalist Papers.  This saves the user from
going back to the network to retrieve meta-information that is closely related 
to what they have already received.

Multiple URNs in a URT is a very flexible section of the implementation rules 
of a URT.  Some clients may wish to ignore any other occurrences of URNs while 
others may wish to parse a very large URT with large numbers of related URNs.  
This is left up to client implementations.  The only requirement is that they 
must at least be able to handle the file.  There is no requirement that they 
keep or use the additional information.

3.3: Ordering Rules

3.3.1 URI Rules of Precedence

In order for the numerous UR* in a URT to make sense, there must be order to
the sequence of items.  The order that makes the most sense is based on
expected time to live.  A URN is meant to be unique over all time and eternity;
therefore, the first occurrence of a URN must have precedence over all other UR*
in that URT.  A URL is meant to be unique to the location of the document.  The
document itself may change, which would cause its meta-information to change,
but not it's URL.  Thus, a URL has precedence over a URM.  Finally, another
occurrence of a URN denotes a new resource that has precedence over subsequent
UR* in the URT.

Also, a URT does not need 2 or more of any URN, URL or URM to be a URT.  A URT
can be made up of just URLs and URMs without a corresponding URN.
Conversely, a URT can have only URNs and URLs, or just URNs, or just URLs, or
even just a collection of URMs.  You can even have a null URT which contains
nothing.

3.3.2 URI Combination Rules

With a URT having some internal structure, certain scenarios become apparent
when certain combinations of UR*s occur.  Listed below are several different
combinations of URNs, URLs and URMs that denote different resource
relationships:

URN,URL and URMs denote one specific instantiation of a network resource at a
specific location on the net.  This is useful for pointing a client at the 
closest source for a resource.

URN and URMs denote meta-information about a URN that is global to all
occurrences of that URN.  If a URL comes after that URM then any URMs after that
URL modify the global URM only for that URL.

URN and URLs denote multiple resource locations with no meta-information.

URL and URMs specifies a location with its associated meta-information but with
no URN.  This is useful for resources that are to transient too deserve a URN.

URN, URL, URM, URN and UR* denotes a URN that has "related" URNs to show
relationship between wholly different resources.  This is used to cache closely
related objects to reduce calls to the network for related meta-information.  

3.5 Example URTs

The following example URTs are not exhaustive.  They are only used to give a
hard example of a URT in order to show the structure:

URN:IANA:626::Dir:6345:::
URL:gopher://gopher.gatech.edu:2048/11/Computing.Resources
URM:IAFA:en_US.iso88591::"
Author: Michael Mealling
"::"
Subject: OIT Computing Resources
":::
URL:http://www.gatech.edu/Computing.Resources
URM:IAFA:en_US.iso88591::"
Author: Michael Mealling
"::"
Subject: OIT Computing Resources (OIT Home Page)
"::"
Size: 16k
":::

4: whois++ servers as URN->URT servers

Since a URT is simply a template and whois++ was specifically built to to handle
anything in template form it seems logical to use whois++ as a resolution 
scheme.  It allows the resolver to handle update records within the protocol 
instead of as a separate function.  It also has the added function of allowing 
for centroids that make global searches of meta-information easier and faster 
(NOTE: This assumes that centroids will scale!).  For more information this 
paper differs to the whois++ specification [1-Fullton] and the WNILS-WG.

5: References

[1-Weider 93] 
   Weider, Chris.  Resource Transponders, March 1993.  Available as 
   ftp://cnri.reston.va.us/internet-drafts/draft-ietf-iiir-transponders-00.txt 
[2-Weider 93] 
   Weider, Chris and Deutsch, Peter.  A Vision of an Integrated Internet
   Information Service, March, 1993.  Available as 
   ftp://cnri.reston.va.us/internet-drafts/draft-ietf-iiir-vision-00.txt 
[3-Weider 93] 
   Weider, Chris and Deutsch, Peter.  Uniform Resource Names, Oct, 1993.
   Available as 
   ftp://cnri.reston.va.us/internet-drafts/draft-ietf-uri-resource-names-01.txt 
[Berners-Lee 1993] 
   Berners-Lee, Tim.  Uniform Resource Locators, March, 93.  Available as 
   ftp://cnri.reston.va.us/internet-drafts/draft-ietf-uri-url-01.txt 
[1-Fullton] 
   Fullton, Jim, Wieder Chris and Spero, Simon.  Architecture of the Whois++
   Index Service, March, 93.  Available as 
   ftp://cnri.reston.va.us/internet-drafts/draft-ietf-wnils-whois-01.txt 
-- 
------------------------------------------------------------------------------
Michael Mealling                     ! Hypermedia WWW, WAIS, and gopher will be
Georgia Institute of Technology      ! here soon via MIME. Your view of the 
Michael.Mealling@oit.gatech.edu      ! internet is about to change completely!
New URM paper with additions! Michael Mealling
Re: New URM paper with additions! Mitra
Re: New URM paper with additions! Michael Mealling
Re: New URM paper with additions! Dirk Herr-Hoyman
Re: New URM paper with additions! Mitra
Re: New URM paper with additions! Mitra