A Formalism for Internet Information References
"Daniel W. Connolly" <connolly@hal.com> Fri, 15 April 1994 22:51 UTC
Received: from ietf.nri.reston.va.us by IETF.CNRI.Reston.VA.US id aa13853; 15 Apr 94 18:51 EDT
Received: from CNRI.RESTON.VA.US by IETF.CNRI.Reston.VA.US id aa13848; 15 Apr 94 18:51 EDT
Received: from mocha.bunyip.com by CNRI.Reston.VA.US id aa21585; 15 Apr 94 18:51 EDT
Received: by mocha.bunyip.com (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA19011 on Fri, 15 Apr 94 16:59:49 -0400
Received: from hal.COM by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA19007 (mail destined for /usr/lib/sendmail -odq -oi -furi-request uri-out) on Fri, 15 Apr 94 16:59:38 -0400
Received: from ulua.hal.com by hal.com (4.1/SMI-4.1.1) id AA03932; Fri, 15 Apr 94 13:59:31 PDT
Received: from localhost by ulua.hal.com (4.1/SMI-4.1.2) id AA04760; Fri, 15 Apr 94 15:45:05 CDT
Message-Id: <9404152045.AA04760@ulua.hal.com>
To: Liam Relihan <relihanl@ul.ie>, Phill Hallam-Baker <hallam@alws.cern.ch>
Cc: Multiple recipients of IETF URI WG mailing list <uri@bunyip.com>, Multiple recipients of WWW Talk discussion list <www-talk@www0.cern.ch>
Subject: A Formalism for Internet Information References
Date: Fri, 15 Apr 1994 15:45:02 -0500
Sender: ietf-archive-request@IETF.CNRI.Reston.VA.US
From: "Daniel W. Connolly" <connolly@hal.com>
Here's something that's been brewing in my mind for a couple years now. It's still fairly rough, but it hangs together, and I'd like to see it form a basis for disussing things like URNs and such: A Formalism for Internet Information References Daniel W. Connolly $Id: formalism.txt,v 1.4 1994/04/15 20:30:56 connolly Exp $ http://www.hal.com/users/connolly/drafts/formalism.txt ABSTRACT ======== This is a mathematical model for references between typical information objects in the internet community; specifically, the model covers the URL concept from the World Wide Web application and the external body mechanism from MIME. I suggest that this model be used to define the requirements for specifications of new URL schemes and MIME access types, and that it provides basis for other extensible reference mechanisms such as the URN and URC mechanisms as variously proposed in the URI working group. INTRODUCTION ============ We begin with definitions of some commonly used notions: The set of octets, or bytes: Octet = { 0, 1, 2, ..., 255 } Strings, sequences, or streams of bytes: Using the notation SEQ{S}, short for { seq | for some N, seq : {0, 1, 2, ... N} -> S } OctetStream = SEQ{Octet} Then we formalize some of the MIME notions: ContentType = { text/plain, application/octet-stream, ... } /* just a finite set. Properties or the elements, aside from identity, are not defined */ Entity = ContentType x OctetStream And finally, we begin with a function to model the resolution of references between entities in terms of an as-yet undefined set of References: Resolve : Reference -> Entity While in general, internet resources are do not all fit the definition of Entity (telnet servers, ftp directories, etc.), the information units transferred between information clients and servers generally do. Thus we begin by discussing references to entites, and we define references to other sorts of resources in terms of the basic Resolve function. For example, the basic model of computation for a WWW client goes: 1. Given some reference r0, compute e0 := Resolve(r0) by sending r0 to the server and getting e0 back. 2. Present e0 3. Let the user choose a reference r1 from e0 (click on an anchor, select a number, drag-select a range of text, enter some search words, etc.) 4. Compute e1 := Resolve(r1) as above 5. Iterate steps 2 thru 4 THE BASIC WEB "GLOBAL FILESYSTEM" REFERENCE =========================================== The WWW reference mechnism, the URL, is a generalization of a filesystem pathname. Dirs <: SEQ{OctetStream} /* <: meaning subset */ File <: OctetStream A normal filesystem pathname doesn't satisfy the properties of a reference for several reasons: The pathname "x/y/z" may reference different entities depending on the "current directory" of the client process. So the relation: Dirs x Filename x Entity is not a function. This can be addressed by using full pathnames. But the path "/etc/motd" references different entities on different hosts. So the WWW application adds hostname information to the path to make it globally unique. Host <: OctetStream /* internet host names */ But even "local-file://host/dir/file" can reference different entities at different times. But if we include a time argument Time = { real numbers } /* time in seconds since start of C.E. calendar, written here in ISO time format: YYYYMMDDHHMMSSZ. For the purposes of this discussion, a one second clock resolution is assumed to be sufficient */ then we have a well defined function: LOCAL-FILE.GET : (Host x Dirs x File) x Time -> OctetStream Many filesystems don't associate an explicit ContentType with every file. But we assume that the following function is well-defined: Resolve.LOCAL-FILE : Host x Dirs x File x Time -> Entity For example, we can require that the content type be part of the reference, or we can use heuristics guided by the filename. So with a content type ct in hand, we have Resolve.LOCAL-FILE(h, d, f, t) = (ct, LOCAL-FILE.GET((h, d, f), t)) Or with a content type heuristic InferType : File -> ContentType we can have Resolve.LOCAL-FILE(h, d, f, t) = (InferType(f), LOCAL-FILE.GET((h, d, f), t)) The local-file: access type shares semantics so far with the anon-ftp: access type. In general, the ftp RETR operation can be modelled as: User <: OctetStream Account <: OctetStream u { {} } /* account is optional */ FTP.GET : (Host x User x Account x Dirs x File) x Time -> OctetStream Note 1: we don't consider the password, since the retrieved entity does not vary as a function of the password Note 2: we don't deal with type ASCII versus IMAGE transfers here either. I don't think has significant impact on the model, but @@ I need to think about it some more... The anon-ftp: access type is a simplification of FTP that obviates the User and Account parameters: ANON-FTP.GET : (Host x Dirs x File) x Time -> OctetStream where ANON-FTP.GET((h, d, f), t) = FTP.GET((h, "anonymous", {}, d, f), t) And gopher is similar: Port = { 0, 1, 2, ... ,65535 } /* IP port number */ Selector = SEQ{Octet - { 9, 13 }} /* no tabs or newlines allowed */ Search <: OctetStream u { {} } /* @@ what subset? */ GType < ContentType /* Gopher types: subset of MIME types */ GOPHER.GET : (Host x Port x Selector) x Time -> OctetStream Resolve.GOPHER : Host x Port x Selector x Search x GType x Time -> Entity where Resolve.GOPHER(h, p, sel, ser, gt, t) = (gt, GOPHER.GET((h, p, sel, ser), t)) The HTTP 0.9 model is nearly identical to gopher: Path = SEQ{ { 33 .. 126 } } /* no whitespace or 8-bit chars */ HTTP0.GET : (Host x Port x Path x Search) x Time -> OctetStream Resolve.HTTP0 : Host x Port x Path x Search x Time -> Entity where Resolve.HTTP0(h, p, path, ser, t) = (text/html, HTTP0.GET((h, p, path, ser), Time)) The HTTP 1.0 model is significantly more complex. We'll get to that later. But in general, for a set Location* and a set Context*, if there are functions g and t and a set s where: g : Location* x Context* -> OctetStream t : Location* x Context* -> ContentType and For all c in Context* and l in Location*, Resolve((s, l, c)) = (t(l, c), g(l, c)) then we say that g is a GET function for Context* and Location*, t is a TYPE function for them, and s is a SCHEME for them. So far we have that LOCAL-FILE.GET is the GET function for Context* = Time and Location* = Host x Dirs x File, ANON-FTP.GET is a GET function for the same set -- hence the need for the distinguishing scheme. We now have some subsets of the set of References. If s is a SCHEME for a set Context* and a set Location*, with corresponding GET* and TYPE* functions, then For all c in Context*, l in Location* (s, l, c) is an element of Reference. since Resolve((s, l, c)) = (TYPE*(l, c), GET*(l, c)) Here we define several current URL schemes in this manner, and I suggest that in order to introduce a new URL scheme, we required a definitions its Location and Context sets along with definitions of GET and TYPE functions. LOCAL-FILE: /* defined somewhat informally, as this is defined in terms of the local host */ SCHEME = "local-file" (a string is an OctetStream, which is a set) Location = Host x Dirs x File Context = Time x (ContentType u { {} }) GET((h, d, f)) given informally: FILE *fp = fopen(path(d,f)); fread(fp...); TYPE((h, d, f), (t, ct)) := if ct = {}, InferType(f) else ct FTP: SCHEME = "ftp" Location = Host x User x Account x Dirs x File Context = Time x (ContentType u { {} }) GET((h, u, a, d, f)) as per FTP RFC: CONNECT h USER u; PASSWD ... Account a /* unless a is {} */ CWD d1 CWD d2 /*@@ how many CWDs? */ ... RETR f /*@@ type i/a command? */ TYPE((h, u, a, d, f), (t, ct)) := if ct = {}, InferType(f) else ct ANON-FTP: SCHEME = "anon-ftp" /* @@ sometimes written ftp */ Location = Host x Dirs x File Context = Time x (ContentType u { {} }) GET((h, d, f)) as per FTP RFC: CONNECT h USER anonymous; PASSWD <client's email address> CWD d1 CWD d2 /*@@ how many CWDs? */ ... RETR f /*@@ type i/a command? */ TYPE((h, d, f), (t, ct)) := if ct = {}, InferType(f) else ct GOPHER: SCHEME = "gopher" Location = Host x Port x GType x Selector x Search Context = Time GET((h, p, gt, sel, ser)) as per Gopher Protocol: telnet h:p if ser = {}, send sel else send sel<tab>ser TYPE((h, p, gt, sel, ser), t) = gt /* @@ other type mechansims? */ HTTP: SCHEME = "http" Location = Host x Port x Dirs x File x Search Accept = { f | f : ContentType -> Real } Context = Time x Accept GET((h, p, d, f, s), (t, a)) and TYPE((h, p, d, f, s), (t, a)) in terms of HTTP GET operation: compute path from d, f, s compute accept encoding from a telnet to h:p send: GET path HTTP/1.0 Accept: <accept encoding> As-Of: t /* @@ new HTTP header. Version? */ @@ forms in HTTP in MAILTO; MIME MAIL-SERVER access type @@ wais, prospero, finger? RELIABILITY ISSUES WITH FILE (AKA MUTABLE OCTET-STREAM OBJECT) REFERENCES ========================================================================= The following problem of references to mutable objects is sharted by several schemes: Suppose an author discovers a file at 1pm and constructs an HREF to that file; then, the file is modified at 2pm, and finally a reader resolves the reference at 3pm: the entity the reader sees is different from what the author saw. That is, GET(loc, tread) != GET(loc, tauthor) and hence Resolve(loc, tread) != Resolve(loc, tauthor) But in many cases, the author and the reader _can_ read the file at different times and get the same result. How can we determine if this is the case? One way is to use the time of last modification of the file. We can model this as a function lastmod satisfying: lastmod : Location* x Time -> Time where For all l in Location*, t1, t2 in Time If t1 >= lastmod(l, t2) then GET*(t2)(l) = GET*(t1)(l) Where GET* is a GET function for Location and Time. We'll call such a function a LASTMOD function for the set Location* and the function GET*. Note 1: I don't believe the FTP RFC specifies how to determine the time of last modification, but it can be determined reliably in many cases. @@ I need to look into this...) Note 2: I don't believe Gopher specifies a way to determine the time of last modification. Perhaps Gopher+ does. Then, in the above example, we have the following: If tauthor >= LASTMOD(loc, tread) Then GET(loc, tread) = GET(loc, tauthor) and hence Resolve((s, loc, tread)) = Resolve((s, loc, tauthor)) That is, if the reader can determine that the file hasn't changed since the link was made, s/he can have confidence in the integrity of the link. If the mod time is later, the client software can warn the reader before transferring the possibly erroneous data. So one way to enhance the typical HREF="local-file://host/dir/file" to be a well-defined Reference is to determine the time context of the reference. This can ben done in any of several ways: 1. Enhance local-file URLs to contain time info; e.g. the author writes: local-file://host/dir/file;time=19940414141000Z 2. Enhance the HTML A element to contain time info; e.g.: <A HREF="local-file://host/dir/file" TIME="19940414141000Z"> 3. If the reference is in a file whose modification time is known, we can make the heuristic assumption that all the references in the file were current as of the last modification of the file. Either of the first two is tedious for hand-edited documents. But they are simple to implement in any cut-n-paste or "paste link" feature. They also work nicely to fill in the missing content type. <A HREF="ftp://host/dir/file;user=connolly;account=general" date="19940414141000Z" content-type="application/postscript"> The third option is more convenient, though less robust. Consider: 1pm: file Y written with info on apples 2pm: file X says "see file Y for more on apples" 3pm: file Y modified to have info on oranged rather than apples 4pm: file X modified, but still says "see Y for apples" 5pm: reader of X cannot detect (by machine) that the link to Y is corrupt. Also: references get copied and pasted, quoted, etc. The time context of the reference should travel with the reference. Another technique for detecting faults in resolving references is to include the MD5 signature of the entity in the reference. This mechanism is somewhat expensive: it only detects faults _after_ transmission of the octet stream, and it requires computation proportional to the length of the entity. A much less reliable technique is to include the octet count of the entity in the reference. The FTP RFC includes a mechanism to detect this type of fault before transmission of the entity. (i.e. you can ask an FTP server how big a file is before you transfer it.) STRING REPRESENTATIONS OF REFERENCES ==================================== The string representation of a reference is an important part of the specification of various applications. But it is not clear that we need one string representation for all applications. It is much more critical that we understand the properties of the objects we are trying to represent. Once we have a definition of the set of object we're representing, we can choose various means to represent the set in various contexts, with well-defined mappings between syntaxes, for example, between W3's HREF="xxx" and MIME's external-body; access-type="xxx" syntaxes. A definition of a scheme syntax may include incomplete string representations, as long as it discusses mechanisms and/or heuristics to determine the remaining parts of a well-defined reference. LOCAL-FILE: string rep((h, d, f), (t, ct)) given in perl as &file_uri("local-file", $h, *d, $f, $t, $ct) where d is a list of the directories and t is in ISO time format sub file_rep{ local($scheme, $host, *dirs, $file, $time, $content_type) = @_; local($sep, $ret); if(@dirs){ $sep = "/"; } grep($_ = &uri_escape($_), @dirs); $ret = $scheme . "://" . &uri_escape(host) . "/" . join('/', @dirs) . $sep . $file; if($time){ $ret .= ";as-of=" . $time; # @# better name for as-of? } if($content_type){ $ret .= ";content-type=" . &uri_escape($content_type); } return $ret; } sub uri_escape{ local($_) = @_; s-[/?=;:]-sprintf("%%%02X", ord($&))-ge; # list of illegal chars? return $_; } FTP: string rep((h, u, a, d, f), (t, ct)) given in perl as &file_uri("ftp", "$u@$h", *d, $f, $t, $ct) # @@ account? where d is a list of the directories and t is in ISO time format NOTE: the string representation is often written without the time and content-type information. In those cases, the content type should be inferred from the filename and the time should be inferred from the context of the reference. ANON-FTP: string rep((h, d, f), (t, ct)) given in perl as &file_uri("anon-ftp", $h, *d, $f, $t, $ct) where d is a list of the directories and t is in ISO time format NOTE: the string representation is often written without the time and content-type information. In those cases, the content type should be inferred from the filename and the time should be inferred from the context of the reference. GOPHER: string rep((h, p, ty, sel, ser), t) given in perl as &file_uri("gopher", $p == 70 ? $h : "$h:$p", (), >ype_char($ct) . $sel, $t) . $ser ? "?$ser" : "" where d is a list of the directories NOTE: the string representation is often written without the time information. And unfortunately, there are no mechanisms for determining whether gopher references with different times refer to the same entity. @@md5 mechanism HTTP: string rep((h, p, d, f, s), (t, a)) given in perl as &file_uri("http", $p == 80 ? $h : "$h:$p", *d, $f, $t, max(a)) . $s ? "?$s" : "" where d is a list of the directories t in ISO time format NOTE 1: the string representation is often written without the time. In these cases, the time should be inferred from context. NOTE 2: the string representation can only represent a limited form of the content type metric function, i.e. "type x is best," and ofthen it is ommitted completely. In those cases, the client should encode a content type metric function based on its capabilities and the user preferences. RELATIVE NAMES ============== Essentially (@@ need more discussion), if we have Resolve(r0) = e0 and a name n is found inside e0, we apply the function Relative : Name x Reference -> Reference and compute Resolve(Relative(n, r0)) RELIABLE CACHING AND MIRRORING ============================== Essentially (@@ need more discussion), if a client wants to compute Resolve(r0), it can, for example, give the whole r0 reference to a proxy server s, ala: Resolve(r0) = Resolve((proxy, (s, r0))) and so the question becomes: is Resolve(r0) in the cache? In order to do mirroring, we need techniques to distribute information about circumstances where a client can substitute references, i.e. where Resoleve(r') = Resolve(r) but r' is easier to get to. This is where heirarchical namespaces come in: we'd like to distribute information of the form: For all *, Resolve("ftp://host1/dir1/*") = Resolve("ftp://host2/dir2/*") LOCATION-INDEPENDENT REFERENCES (URN's) ======================================= Rather than: Resolve : Scheme x Location x Context -> Entity we might as well write: Resolve : Scheme x Name x Context -> Entity for example we can define: RFC.GET : Name x ContentType -> OctetStream RFC.FILENAME : Name x ContentType -> File as RFC.FILENAME(n, c) = "n.txt" if c is text/plain, "n.ps" if c is application/postscript For all time t /* since rfc's don't change */ RFC.GET(n, c) = ANON-FTP.GET(("ds.internic.net", "/pub/rfc", RFC.FILENAME(n, c)), t) FUTURE DISCUSSION ================= @@ security, scalability, reliability (fault detection and tolerance, caching, migration, compression, encryption, authentication) Daniel W. Connolly "We believe in the interconnectedness of all things" Software Engineer, Hal Software Systems, OLIAS project (512) 834-9962 x5010 <connolly@hal.com> http://www.hal.com/%7Econnolly/index.html
- A Formalism for Internet Information References Daniel W. Connolly
- Re: A Formalism for Internet Information Referenc… Liam Relihan