Re: how to make progress on the URL document
Tim Berners-Lee <timbl@ptpc00.cern.ch> Thu, 24 March 1994 02:31 UTC
Received: from ietf.nri.reston.va.us by IETF.CNRI.Reston.VA.US id aa24955; 23 Mar 94 21:31 EST
Received: from CNRI.RESTON.VA.US by IETF.CNRI.Reston.VA.US id aa24951; 23 Mar 94 21:30 EST
Received: from mocha.bunyip.com by CNRI.Reston.VA.US id aa13127; 23 Mar 94 21:29 EST
Received: by mocha.bunyip.com (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA05871 on Wed, 23 Mar 94 12:58:18 -0500
Received: from dxmint.cern.ch by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA05842 (mail destined for /usr/lib/sendmail -odq -oi -furi-request uri-out) on Wed, 23 Mar 94 12:56:06 -0500
Received: from ptpc00.cern.ch by dxmint.cern.ch (5.65/DEC-Ultrix/4.3) id AA14936; Wed, 23 Mar 1994 18:55:48 +0100
Received: by ptpc00.cern.ch (NX5.67d/NX3.0S) id AA14263; Wed, 23 Mar 94 18:58:05 +0100
Date: Wed, 23 Mar 1994 18:58:05 +0100
Sender: ietf-archive-request@IETF.CNRI.Reston.VA.US
From: Tim Berners-Lee <timbl@ptpc00.cern.ch>
Message-Id: <9403231758.AA14263@ptpc00.cern.ch>
Received: by NeXT.Mailer (1.95)
Received: by NeXT Mailer (1.95)
To: "Mark P. McCahill" <mpm@boombox.micro.umn.edu>
Subject: Re: how to make progress on the URL document
Cc: mitra@pandora.sf.ca.us, uri@bunyip.com
Reply-To: timbl@www0.cern.ch
Mitra and Mark, you ask for diffs. You're not going to like them because the formatting messes it up quiet a lot but for what it's worth here it is. Tim diff url-spec.txt /pub/www/doc/draft-uri-url-02.txt 2,3c2,3 < draft-ietf-uri-url-03.{ps,txt} URI working Group < Expires 21 September 1994 21 March 1994 --- > draft-ietf-uri-url-02.{ps,txt} CERN > Expires 1 July 1994 1 Jan 1994 8,9c8,9 < A Syntax for the Expression of < Access Information of Objects on the Network --- > A Unifying Syntax for the Expression of > Names and Addresses of Objects on the Network 12,23c12 < ABOUT THIS DOCUMENT < < This document specifies a Uniform Resource Locator (URL), the < syntax and semantics of formalized information for location and < access of resources on the Internet. < < This document was written by the URI working group of the Internet < Engineering Task Force. Comments may be addressed to the editor, < Tim Berners-Lee <timbl@info.cern.ch>, or to the URI-WG < <uri@bunyip.com>. Discussions of the group are archived at < < <http://www.acl.lanl.gov/URI/archive/uri-archive.index.html> --- > Status of this memo 25,41d13 < This document is bound by the Requirements Specification in < preparation. < < The work is derived from concepts introduced by the World-Wide Web < global information initiative, whose use of such objects dates < from 1990 and is described in "Universal Resource identifeirs for < the World-Wide Web", RFCXXX. < < This document is available in hypertext form, with links to < background information, as: < < <http://info.cern.ch/hypertext/WWW/Addressing/URL/Overview.html> < < . < < STATUS OF THIS MEMO < 53c25,29 < Distribution of this document is unlimited. --- > Distribution of this document is unlimited. Please send comments > to the author as timbl@info.cern.ch. or to the discussion list > ietf-url@merit.edu. > > Abstract 54a31,53 > Many protocols and systems for document search and retrieval are > currently in use, and many more protocols or refinements of > existing protocols are to be expected in a field whose expansion is > explosive. > > These systems are aiming to achieve global search and readership of > documents across differing computing platforms, and despite a > plethora of protocols and data formats. As protocols evolve, > gateways can allow global access to remain possible. As data > formats evolve, format conversion programs can preserve global > access. There is one area, however, in which it is impractical to > make conversions, and that is in the names and addresses used to > identify objects. This is because names and addresses of objects > are passed on in so many ways, from the backs of envelopes to > hypertext objects, and may have a long life. > > A common feature of almost all the data models of past and proposed > systems is something whicch can be mapped onto a concept of "object" > and some kind of name, address, or identifier for that object. One > can therefore define a set of name spaces in which these objects > can be said to exist. > > Practical systems need to access and mix objects which are part of 56a56 > 58a59,467 > different existing and proposed systems. > > This paper discusses the requirements on a universal syntax which > can be used to encapsulate a name in any registered name space. > This will allow names in different spaces to be treated in a common > way, even though names in different spaces have differing > characteristics, as do the objects to which they refer > > The universal syntax to objects available using existing protocols, > and may be extended with technology. It makes a recommendation for > a generic syntax, and for specific forms for "Uniform Resource > Locators" (URLs)of objects accessible using existing Internet > protocols. > > The syntax has been in widespread use by World-Wide Web software > since 1990. > > Terms > > The objects on the network which are to be named and addressed > include typically objects which can be retrieved, and objects which > can be searched. There is a great variety of other objects which > may support other operations. We imply nothing about the contents > of objects in this document. Whereas human-readable documents are > currently the center of interest of the field, we envisage all > aspects discussed in this paper applying to generalized objects > when systems to handle them become available. The "object" is the > unit of reference and need not correspond to any unit of storage. > We refer to objects which can be searched as "indexes". We > emphasize that this is the abstract view of the client, and these > objects need not correspond to physical files on computers. We > refer to the person who does the retrieval or searchiing as the > user. > > Within this document, we use the terms "name" very generally for a > string of characters describing an object, whatever its > combination of properties mentioned below. (The term usually has a > narrower meaning but we needed some term for the universal set.). > This uniform syntax applied to a generic name is known as a Uniform > Resource Identifier (URI). The term "address" is reserved for an > string which specifies a more or less physical location. The term > "locator" refers to a URL as here defined. URIs which have a > greater persistence than URLs are referred to as URNs. > > Characteristics > > This section characteristics of various naming schemes, > requirements which some ofexisting schemes meet, and requirements > for the URL scheme itself. URLs, as an introduction of and > background for the Recommendations section. > > USES OF NAMES AND ADDRESSES > > > > > Berners-Lee 2 > > A name allows a user, with the help of a "client" program, to > retrieve or operate on objects via a "server" program. A name may > be passed for example: > > In communication of any form between two people, to refer to a > document, or part of a document; > > As part of the description of a link associated with a hypertext > document; > > As part of the result of searching an index. > > Some typical requirements on a name which are met to a varying > degree by various schemes are for example that the name is > > Persistent A given name will remain valid as long as it > is needed; > > Extensible A given naming syntax will remain valid > through the introduction of new protocols and > directory technologies; > > Resolvable A name will contain enough information to > allow the document or index to which it > refers to be accessed, perhaps via resolution > into an intermediate, more physical, name. > > Unique Each object can only have one such name. > The fact that two such names are different > implies that the objects to which they refer > are different (in some way). > > Unambiguous The fact that two names are identical > implies that the objects named are the same > (in some way). > > The syntax discussed is the syntax of one name, be it a lasting > name or a physical address. When a directory server or hypertext > link contains a set of alternative names, then that is beyond the > scope of this syntax. Similarly, a syntax for describing a > compound object is outside the scope of this syntax. The specific > locator name spaces (defined under the umbrella of the general > syntax) each meet the requirements above to a greater or lesser > extent. > > CURRENT PRACTICE > > Current protocols use many different standards for names. For some > protocols, such as ISO-10163 Search and Retrieve protocol[16], the > names returned in a search are only valid during the session. For > others, such as FTP[9], they are lasting names which may be used > for object retrieval at a later time. Typically, however, they are > not long-lasting names which are independent of the location of the > > > > Berners-Lee 3 > > object. Such names may be provided using directory servers such as > x.500. They will refer to the registration, however formal or > informal, of a object with a particular organisation or person. > Both hypertext and manual references rely on long- lasting names. > Current names are basically location specifiers (addresses). These > may be known as Uniform Resource Locators (URLs). They give the > necessary parts of an address for a reader to access an information > provider using the given protocol, and ask for the object required. > Examples of names used by various protocols include > > File Transfer Protocol (Postel 1985): > > Host name or IP-address > > [TCP port] > > [user name, password] > > Filename > > W.A.I.S. (Kahle 1990) > > Host name or IP-address > > [TCP port] > > local document id > > Gopher (Alberti 1991) > > Host name or IP-address > > [TCP port] > > database name > > selector string > > HTTP (Berners-Lee 1991) > > Host name or IP-address > > [TCP port] > > local object id > > NNTP (Kantor 1986) > > NNTP group > > Group name > > NNTP article > > > > Berners-Lee 4 > > Host name > > unique message identifier > > Prospero links (Neuman 1992) > > Host name or IP address > > [UDP port] > > Host specific object name > > [version] > > [identifier]* > > x.500 distinguished name > > Country > > Organisation > > Organisational unit > > Person > > Local object identifier > > Other systems with their own naming schemes include BITNET > "LISTSERV" application, FTAM file retrieval, SQLnetTM remote > database search, proprietary distributed file systems, etc. > Conventional syntax for writing these addresses involve various > forms of punctuation to separate these parts. This sometimes, but > not always, allows the naming scheme to be deduced from the > punctuation. For example, a name of the form > xxx.yyy.zz.edu:/pub.aa.bb.cc often implies anonymous FTP access. > However, there is no well-defined algorithm for parsing an > arbitrary name, as there is no common syntax. > > EXPANDABILITY > > There will necessarily be a phase during which lasting names will > become more common, as the deployment of directory services > increases to the point where every user has direct or indirect > access to one. Even then, however, one can envisage more than one > competing directory system, and cases in which physical names are > still required. A directory service takes a lasting name and > reduces it to a physical address (or set of addresses) which, > though less useful for lasting reference, is the only way to > actually retrieve the object. An addressing syntax is required > which will be able to encompass existing physical address spaces, > and be extendible to any future protocols. This requires that it > contain an identifier for the protocol in use. The format of the > > > > Berners-Lee 5 > > rest of the address will necessarily depend to a certain extent on > the protocol. > > RELEVANCE > > The life of a name is limited by any information contained within > it which may become prematurely invalid. It is therefore necessary > to limit the contents of a name to the information required for the > operations above. Other extraneous information about the object > (its size, data format, authorisation details, etc.) may in general > change with time and should not be part of the name. One might > expect such information to be part of the "header" of a object, and > for protocols to allow the header information to be retrieved > independently of the objects themselves. Any physical address may > be subject to change with time: hence we encourage the move to > lasting names and directory services. > > UNIQUENESS > > Clearly one requires unambiguous names in the sense that one name > should refer to only one logical object. This is the case with all > the addressing schemes in use, whether they are directory systems > or physical addresses. (The internet addresses all rely on the > domain name (Mockapetris 1987) of the host to achieve this). > However, given that names can be translated, many apparently > different names may lead to the same object. Any object may > therefore be referred to by many names. One needs to be able to > know whether two objects, retrieved through different paths, are > in fact the same object. It is suggested that each object have a > unique "official" name. This name could be stored in the object in > some representations, or stored in a database accessible to the > server, for example. Any references within that object should be > parsed in the context of the official name. In the presence of a > directory service, the official name will normally be the > registered name of the object. However, a name in any scheme will > do, so long as it is completely specified. On systems which do not > allow the name to be stored (such as anonymous FTP archive sites), > a possible ambiguity will always exist as to whether two similarly > named objects are in fact the same. Note that Internet newsgroup > names are unique world-wide, and news articles carry a unique > message id. In most other cases, however, there is no guarantee > that dereferencing a URL will work, or that if it does the object > it refers to will in fact be the object intended. URLs such as FTP > addresses are transient in that files may be moved and even > replaced by different files of the same name. This disorganisation > may be limited by good server management, but a naming scheme which > is independent also of internet host name is obviously preferable. > > READABILITY BY PEOPLE > > This requirement has been put forward by several people (Clifford > Lynch, Douglas Engelbart among others), and disputed by others. > The author's view is that it will be a while before technology and > > > > Berners-Lee 6 > > standardisation have reached the point at which names and addresses > will be hidden from human beings. As long as they must be written > on the backs of envelopes and "cut and pasted" between workstation > windows, there is a strong need for names to be > > Short > > Composed of printable (preferably non-white) characters > > To a certain extent, understadable by a human being. > > STRUCTURE OF NAMES AND ADDRESSES > > A physical address is required in order for: > > The user's program to contact the server; > > The server to perform the operation (e.g. search and index, > retrieve a object, or look up the name) and return a result; > > The user's program to locate an individual position or element > within a returned object. > > This suggests that a name be structured, such that the parts > necessary for these three operations be separate and only used by > those system elements which need those parts. This corresponds to > the basic principle of information hiding. In fact, four parts > are necessary, including the indicator of the naming scheme to be > used: > > The naming scheme: a registered identifier for the protocol. > > The name of a suitable server. The format of this part must be > well defined. It will depend on the lower-layer protocols in > use. Systems which use widely distributed information, such as > x.500 and NNTP, do not need this part as each client generally > contacts his nearest server (or a particular server). > > Information to be passed to the server. This may be private to > the server, as all names may be generated and used by the same > server. This part of the name should be opaque to the client. > > Information to be used by the application once the object has > been retrieved. This part is private to the application (or, > more strictly, the data format) and so cannot be defined here. > > Both lasting names and physical addresses often share a > hierarchical structure. This follows often from the organisation of > the system. From the naming point of view, it has the advantage > that a reference in one object to another object need not include > that part of the structure which is common to both names. > > CHOICES FOR A UNIVERSAL SYNTAX > > > > Berners-Lee 7 > > The requirements above leave little room for choice save for the > order and punctuation of the elements of an address. It is only > reasonable for the order of writing of the parts to be consistently > from left to right (or right to left) with increasing specificity. > Punctuation schemes fall into two categories (Huitema 1991): tagged > schemes in which field are given names, and fields which use > special characters and field order. The latter tend to be more > compact schemes. > > > protocol: aftp host: xxx.yyy.edu path: > > /pub/doc/README > > PR=aftp; H=xx.yy.edu; PA=/pub/doc/README; > > PR:aftp/xx.yy.edu/pub/doc/README > > /aftp/xx.yy.edu/pub/doc/README > > Fig 1. Some alternative tagged and untagged representations > > The choice of special symbols for punctuation tends to be a matter > of taste. It is easier to read addresses whose symbols correspond > to those of one's favourite operating system. A variety of symbols > is needed so that when a name is abbreviated it is possible to tell > which parts have been omitted. > > The recommendation below uses special characters in order to > achieve a compact name, and uses where possible punctuation symbols > established in the internet or unix community. > > The choice of escape character for introducing representations of > non-allowed characters also tends to be a matter of taste. An ANSI > standard exists in the C language, using the back-slash character > "\". The use of this character on unix command lines, however, can > be a problem as it is interpreted by many shell programs, and would > have itself to be escaped. > > There is a conflict between the need to be able to represent many > characters including spaces within a URL directly, and the need to > be able to use a URL in environments which have limited character > sets or in which certain characters are prone to corruption. This > conflict has been resolved by use of an hexadecimal escaping method > which may be applied to any characters forbidden in a given > context. When URLs are moved between contexts, the set of > characters escaped may be enlarged or reduced unambiguously. > > The use of multiple white space characters is discouraged in URLs > to be printed or sent by electronic mail. This is because of the > frequent introduction of extraneous white space when lines are > wrapped by systems such as mail, or sheer necessity of narrow > column width, and because of the inter-conversion of various forms > > > > Berners-Lee 8 > > of white space which occurs during character code conversion and > the transfer of text between applications. > 72c481 < URL SYNTAX --- > FULL FORM 82,90c491,492 < PrePrefix < < To be a Uniform Resource Locator as currently defined by the URI < working group, the whole string must start with a constant prefix < "URL:". Note that to save space in this document, URLs have been < quoted throughout without this preprefix. < < Scheme < --- > SCHEME > 97,99c499,501 < Those schemes which refer to internet protocols mostly have a < common syntax for the rest of the object name. This starts with a < double slash "//" to indicate its presence, and continues until the --- > Those schemes which refer to internet protocols have a common > syntax for the rest of the object name. This starts with a double > slash "//" to indicate its presence, and continues until the 112,116d513 < < < < Berners-Lee 2 < 121c518,522 < --- > > > > Berners-Lee 9 > 156c557 < the syntax shall not be used unencoded in a URL. --- > the syntax shall not be used in a URL. 162,167c563,566 < awkward in a given environment. Because a % sign always indicates < an encoded character, a URL may be made safer simply by encoding < any characters considered unsafe, while leaving already encoded < characters still encoded. Similarly, in cases where a larger set < of characters is acceptable, % signs can be selectively and < reversibly expanded. --- > awkward in a given environment. As a % sign always indicates an > encoded character, a URL may be made safer simply by encoding any > characters considered unsafe, while leaving already encoded > characters still encoded. 170,174d568 < < < < Berners-Lee 3 < 176c570 < hexadecimal or base 64 would be more appropriate.) --- > hex or base 64 would be more appropriate.) 177a572,574 > The same considerations apply to mapping local fragment identifiers > onto the fragmentid part of a URL. > 179a577,580 > > > Berners-Lee 10 > 182c583 < protocols follow. The schemes covered are --- > protocols follow. 184,208c585,593 < http Hypertext Transfer Protocol < < ftp File Transfer protocol < < gopher The Gopher protocol < < mailto Electronic mail address < < mid Message identifiers for electroni mail < < cid Content identifiers for MIME body part < < news Usenet news < < nntp Usenet news for local NNTP access only < < prospero Access using the prospero protocols < < telnet , rlogin and tn3270 < Reference to interactive sessions < < wais Wide Area Information Servers < < The schemes for x.500, network management database and whois++ have < not been specified and may be the subject of futher study. --- > HTTP > > The HTTP protocol specifies that the path is handled transparently > by those who handle URLs, except for the servers which de-reference > them. The path is passed by the client to the server with any > request, but is not otherwise understood by the client. The > fragmentid part is not sent with the request. The search part, if > present, is sent. Spaces in URLs should be escaped for transmission > in HTTP. 210,214d594 < The url: prefix is reserved for use in encoding a Uniform Resource < Name when that has been developed by the IETF working group. < < New schemes may be registered at a later time. < 218,223c598,603 < file system of the given host. The FTP protocol is used, as defined < in RFC957 or any successor. The port number, if present, gives the < port of the FTP server if not the FTP default. (A client may in < practice use local file access to retrieve objects which are < available though more efficient means such as local file open or < NFS mounting, where this is available and equivalent). --- > file system of the given host. The FTP protocol is used. The port > number if given gives the port of the FTP server if not the FTP > default. (A client may in practice use local file access to > retrieve objects which are available though more efficient means > such as local file open or NFS mounting, where this is available > and equivalent). 225,232c605 < User name and password < < The syntax allows for the inclusion of a user name and even a < < < < Berners-Lee 4 < --- > The syntax allows for the inclusion of a user name and even a 236,237c609 < is "anonymous" and the password the user's Internet-style mail < address . --- > is "anonymous" and the password the user's mail address. 239,242c611,620 < Where possible, this mail address should correspond to a usable < mail address for the user, and preferably give a DNS host name < which resolves to the IP address of the client. Note that servers < currently vary in their treatment of the anonymous password. --- > The adoption of a unix-style syntax involves the conversion into > non-unix local forms by either the client or server. Some non-unix > servers do this, but clients wishing to access sites which do not > have unix-style naming will need certain algorithms to enable > other file systems to be identified and treated. Client software > may also have to be flexible in terms of the sequence of FTP > commands used with different varieties of server. In view of a > tendency for file systems to look increasingly similar, it was felt > that the URL convention should not be weighed down by extra > mechanisms for identifying these cases. 244,296d621 < Path < < The FTP protocol allows for a sequence of CWD commands (change < working directory) prior to a RETR (retrieve) which actually < accesses a file. The arguments of any CWD commands are successive < segment parts of the URL, and the filename argument to the RETR < command is the final segment of the URL path. < < Note < < In the case in which the file system of the server is known or < guessed by the client, the path may possibly converted into a < filename. This may (in some cases) allow the file to be retrieved < in one RETR command with no CWD command. In the case of unix, the < filename will in fact look the same as the URI path. This must NOT < be taken to indicate that the URL is a unix filename. In < practice, as many FTP servers in fact have or emulate unix file < systems, it may in fact be time-efficient to attempt first a direct < retrieval guessing unix syntax, and, if that fails, to attempt the < official sequence of succession of directory changes followed by a < RETR command. < < There is no common hierarchical model to the FTP protocol, so if a < directory change command has been given, it is impossible in < general to deduce what sequence should be given to navigate to < another directory for a second retrieval, if the paths are < different. The only reliable algorithm is to disconnect and < reestablish the control connection. However, if no directory < changes have been made, but direct retrieval has been done, then < the control connection may be kept. Another possible < uninvestigated method is to use CDUP on the trial assumption of a < hierarchical structure to return a point in common between the < first and second URLs. < < (This note previously read: "The adoption of a unix-style syntax < involves the conversion into non-unix local forms by either the < client or server. Some non-unix servers do this, but clients < wishing to access sites which do not have unix-style naming will < need certain algorithms to enable other file systems to be < identified and treated. Client software may also have to be < flexible in terms of the sequence of FTP commands used with < different varieties of server. In view of a tendency for file < < < < Berners-Lee 5 < < systems to look increasingly similar, it was felt that the URL < convention should not be weighed down by extra mechanisms for < identifying these cases." ) < < Data type < 303c628 < but it is outside the scope of this paper. --- > but it outside the scope of this paper. 305,328c630 < An FTP URL may specify the method by which an object is to be < retrieved. Two of the modes correspond to the FTP "Data Types" < ASCII and IMAGE for the retrieval of a document, as specified in < FTP by the TYPE command. One mode indicates directory access. < < The data type is specified by a suffix to the URL separated by an < unencoded exclamation mark (ASCII 21 hex). Possible suffixes are: < < !I Use FTP image (I) mode to perform data < transfer. < < !A Use FTP ASCII (A) mode to perform data < transfer < < !D Use FTP directory list commands to read < directory < < [suggestion: tenex. reference?] < < Transfer Mode < < Stream Mode is always used. < < HTTP --- > NEWS 330,343c632,633 < The HTTP protocol specifies that the path is handled transparently < by those who handle URLs, except for the servers which de-reference < them. The path is passed by the client to the server with any < request, but is not otherwise understood by the client. The < fragmentid part is not sent with the request. The search part, if < present, is sent. Spaces and control characters in URLs must be < escaped for transmission in HTTP. < < GOPHER < < Gopher selector strings may contain any characters other than tab, < return, or linefeed, so it is important to encode all disallowed < characters and encode any space characters so these characters are < not altered during transport of the URL. Note that since gopher --- > The news locators refer to either news group names or article > message identifiers which must conform to the rules of RFC 850. A 347c637 < Berners-Lee 6 --- > Berners-Lee 11 349,357c639,642 < selector string are opaque and in many cases map to native file < system of the gopher server, so encoding of disallowed characters < in the selector string is to map to binary codes rather than ISO < character sets. In other words, the "%" character followed by two < hexadecimal digits is used to encode binary data. Clients shall < not interpret gopher selector strings. While many Gopher servers < map to Unix file systems, you cannot assume that "/" characters < imply a heirarchy since Gopher servers on non-Unix file systems may < use the "/" as part of a file name. --- > message identifier may be distinguished from a news group name by > the presence of the commercial at "@" character. These rules imply > that within an article, a reference to a news group or to another > article will be a valid URL (in the partial form). 359,361c644,645 < < < The format of a gopher URL is: --- > A news URL may be dereferenced using NNTP or using any other > protocol for the conveyance of usenet news articles. 363,508c647 < 1. A single-character field to denote the Gopher type of the < resource to which the URL refers. < < 2. The gopher selector string. Note that some gopher selector < strings begin with a copy of the gopher type character, in which < case that character will occur twice consecutively. Also note < that the gopher selector string may be an empty string since < this is how gopher clients refer to the top-level directory on < a gopher server. < < 3. An encoded tab character (%09) to seperate the gopher < selector string from the optional search string (see 4 below). < < 4. If the URL does not refer to a Gopher+ item and if there is < no gopher search string then parts 3, 4, 5, and 6 of the URL < are optional < < 4.) The gopher search string. If the URL refers to a search to < be submitted to a gopher search engine, the search string is < required. Otherwise this is an empty string. < < 5.) A question mark [suggestion: an encoded tab character < (%09)] to seperate the gopher search string from the optional < gopher+ string (see 6 below). [suggestion: Note that if the URL < refers to a gopher+ item and does not have a gopher search < string, there will be two encoded tab characters in a row.] < < 6.) The Gopher+ string. Gopher+ strings consist of a one or more < characters and are used to represent information required for < retrieval of the Gopher+ item. Gopher+ items may have alternate < views, arbitrary sets of attributes, and may have electronic < forms associated with them. To accomodate the various Gopher+ < objects, the Gopher+ string in the URL must accomodate a < mapping of the information a Gopher+ client sends to the server. < This makes this section a bit long since we basically cover the < entire Gopher+ protocol here. < < When a Gopher server returns a directory listing to a client, < Gopher+ items are tagged with either a "+" (denoting gopher+ items) < < < < Berners-Lee 7 < < or a "?" (denoting items which have a +ASK form associated with < them). A Gopher+ string which is only a "+" refers to the default < view (data representation) of the item. To retrieve this item a < gopher+ client should send < < a_gopher_selector<tab>+<cr><lf> < < to the gopher+ server. < < Note that items which have a +ASK asssociated with them (ie. < Gopher+ items tagged with a "?") require the client to fetch the < item's +ASK attribute to get the form definition, and then ask the < user to fill out the form and return the user's responces along < with the selector string to retrieve the item. Gopher+ clients < know how to do this but depend on the "?" tag in the gopher+ item < description to know when to handle this case. The "?" is used in < the Gopher+ string to be consistent with Gopher+ protocol's use of < this symbol. < < To refer to the Gopher+ attributes of an item, the Gopher+ string < might consist of "!" or "$". "!" refers to the all of a gopher+ < item's attributes. "$" refers to all the item attributes for all < items in a Gopher directory. To retrieve an item or directory's < attributes, a gopher client will send: < < a_gopher_selector<tab>!<cr><lf> < < for items or < < a_gopher_selector<tab>$<cr><lf> < < for directories to the gopher+ server. < < To refer to specific attributes, the Gopher+ string is < "!attribute_name" or "$attribute_name". For example, to refer to < the attribute containing the abstract of an item, the Gopher+ < string would be "!+ABSTRACT". To refer to several attributes, < clients send the server the attribute names seperated by spaces so < it is neccesary to seperate the attribute names with coded spaces. < To retrieve a collection of item attributes specified with a < gopher+ string of "!+ABSTRACT%20+SMELL" a gopher client would send < < a_gopher_selector<tab>!+ABSTRACT +SMELL<cr><lf> < < to the gopher server. < < Gopher+ allows for optional alternate data representations < (alternate views) of items. To retrieve a Gopher+ alternate view, < the gopher+ client sends the appropriate view and language < identifier (found in the item's +VIEW attribute). To refer to a < specific Gopher+ alternate view, the URL's Gopher+ string would be < in the form "+view_name%20language_name". For example, a gopher+ < string of "+application/postscript%20Es_ES" refers to the spanish < < < < Berners-Lee 8 < < language postscript alternate view of a gopher+ item. To retrieve < this alternate view the client would send < < a_gopher_selector<tab>+application/postscript Es_ES<cr><lf> < < to the gopher server. < < The gopher+ string for a URL that refers to an item referenced by < an ASK form filled out with specific values is essentially a coded < version of what the client sends to the server. The gopher+ string < will be of the form < < +%091%0D%0A+-1%0D%0Aask_item1_value%0D%0Aask_item2_value%0D%0A.%0D%0 < A < < To retrieve this item, the gopher client sends: < < a_gopher_selector<tab>+<tab>1<cr><lf> < +-1<cr><lf> < ask_item1_value<cr><lf> < ask_item2_value<cr><lf> < .<cr><lf> < < to the gopher server. < < For a really complex example, consider a URL that refers to an < alternate view of an item that is referenced with a filled-out < Gopher +ASK form. The gopher+ string will be of the form: < < < +view_name%20language_name%091%0D%0A+-1%0D%0Aask_item1_value%0D%0A < ask_item2_value%0D%0A.%0D%0A < < To retrieve this item, the gopher client sends: < < a_gopher_selector<tab>+view_name language_name<tab>1<cr><lf> < +-1<cr><lf> < ask_item1_value<cr><lf> < ask_item2_value<cr><lf> < .<cr><lf> < < to the gopher server. < < Summary: gopher+ string part of Gopher URL --- > Note1: 510,621c649 < < < To refer to an item which has an ASK form associated with it where < the intent is to allow the user to enter values into the form as < part of the retrieval process: < < %3F [was: ?] < < < < < Berners-Lee 9 < < To refer to all or specific attributes of a gopher item: < < ![attribute_name][%20attribute_name][%20attribute_name]... < < < To refer to all or specific attributes of a gopher directory: < < $[attribute_name][%20attribute_name][%20attribute_name]... < < < To refer to the content of a gopher+ item (including an item < referred to by specific values in a filled-out ASK form): < < +[view_name[%20language_name]] < [%091%0D%0A+-1%0D%0Aask_item1_value%0D%0Aask_item2_value...%0D%0A. < %0D%0A] < < < < Overall summary and examples < < < The general format of a Gopher URL path refering to a gopher type < "T" item is: < < gopher://host [port]/T[gopher_selector]%09[search_string]?[gopher+_s < tring] < < < Examples: < < An example of a URL pointing to a gopher type 0 item (a document) < is: < < gopher://host [port]/0a_gopher_selector < < < An example of a URL pointing to a gopher type 7 item (a search < engine) where the string foobar is to be submitted to the search < engine is: < < gopher://host [port]/7a_gopher_selector%09foobar < < < An example of a URL pointing to a Gopher+ type 0 item (a document) < is: < < gopher://host [port]/0a_gopher_selector%09%09some_gplus_stuff < < < An example of a URL pointing to a Gopher+ type 0 (document) item's < attribute information is: < < < < < Berners-Lee 10 < < gopher://host [port]/0a_gopher_selector%09%09! < < < An example of a URL pointing to a Gopher+ document's spanish < postscript representation is: < < gopher://host [port]/0a_gopher_selector%09%09+application/postscript < %20Es_ES < < . < < MAILTO < < This allows a URL to specify an RFC822 addr-spec mail address. < Note that use of % , for example as used in forming a gatewayed < mail address, requires conversion to %25 in a URL. < < This semantics may be considered to be that the object referred to < by the mailto: URL is the set of messages sent to or from that < address. There is no algorithm to retrieve this set, but the SMTP < protocol allows messages to be added to it, and any given user may < be aware of a subset of its members. < < NEWS < < The news locators refer to either news group names or article < message identifiers which must conform to the rules for a < Message-Idof RFC 1036 (Horton 1987). A message identifier may be < distinguished from a news group name by the presence of the < commercial at "@" character. These rules imply that within an < article, a reference to a news group or to another article will be < a valid URL (in the partial form). < < A news URL may be dereferenced using NNTP (RFC977, Kantor 86) (The < ARTICLE by message-id command ) or using any other protocol for the < conveyance of usenet news articles, or by reference to a body of < news articles already received. < < Note1: < < Among URLs the "news" URLs are anomalous in that they are --- > Among URLs the news: URLs are anomalous in that they are 629,630c657,658 < Note 2: < --- > Note 2: > 634,638d661 < < < < Berners-Lee 11 < 641,643c664,666 < Suggested subject of study in conjunction with NNTP working group. < Further extension possible may be to allow the naming of subject < threads as addressable objects. --- > Suggested subject of study in conjunction with NNTP WG. Further > extension possible may be to allow the naming of subject threads as > addressable objects. 645,646c668,669 < NNTP < --- > NNTP > 650,651c673 < message identifier. In all other cases the "news" scheme should be < used. --- > message identifier. 655d676 < The NNTP protocol must be used. 657,661c678,684 < Note1. < < This form of URL is not of global accessability, as typically NNTP < servers only allow access from local clients. Note that the < article numbers within groups vary from server to server. --- > Note1. > > This form of URL is not of global accessiablity, as typically NNTP > servers only allow access from local clients. This form or URL > should not be quoted outside this local area. It should not be > used within news articles for wider circulation than the one > server. 663,668c686,699 < This form or URL should not be quoted outside this local area. It < should not be used within news articles for wider circulation than < the one server. This is a local identifier for a resource which is < often available globally, and so is not recommended except in the < case in which incomplete NNTP implementations on the local server < force its adoption. --- > WAIS > > The current WAIS implementation public domain requires that a > client know the "type" of a object prior to retrieval. This value > is returned along with the internal object identifier in the search > response. It has been encoded into the path part of the URL in > > > > Berners-Lee 12 > > order to make the URL sufficient for the retrieval of the object. > Within the WAIS world, names do not of course not need to be > prefixed by "wais:" (by the partial form rules). 679c710 < version number. If present, the version number is separated from --- > version number. If present, the version number is seperated from 681c712 < zero zero), this being an escaped string terminator (null). --- > zero zero), this being an escaped string terminator (null). 683c714 < access method and are not represented as Prospero URLs. --- > access method and are not represented as Prospero URLs. 684a716,740 > GOPHER > > The first character of the URL path part (after the initial single > slash) is a single-character "type" field which is that used by the > Gopher protocol. The rest of the path is the "selector string", > with disallowed characters encoded. Note that some selector strings > begin with a copy of the gopher type character, in which case that > character will occur twice consecutively in the URL. If the type > character and selector are omitted, the type defaults to "1". > Gopher links which refer to non-Gopher protocols are represented > directly as URLs of the underlying access method and are not > represented as Gopher URLs. > > MAILTO > > This allows a URL to specify an RFC822 addr-spec mail address. > Note that use of % , for example as used in forming a gatewayed > mail address, requires conversion to %25 in a URL. > > This semantics may be considered to be that the object referred to > by the mailto: URL is the set of messages sent to or from that > address. There is no algorithm to retrieve this set, but the SMTP > protocol allows messages to be added to it, and any given user may > be aware of a subset of its members. > 691a748,749 > this is a less desirable, though currently common, solution. > 695c753 < Berners-Lee 12 --- > Berners-Lee 13 697c755,762 < this is a less desirable, though currently common, solution. --- > X500 > > The mapping of x500 names onto URLs is not defined here. A decision > is required as to whether "distinguished names" or "user friendly > names" (ufn), or both, should be allowed. If any punctuation > conversions are needed from the adopted x500 representation (such > as the use of slashes between parts of a ufn) they must be defined. > This is a subject for study. 699c764 < WAIS --- > WHOIS 701,707c766,770 < The current WAIS implementation public domain requires that a < client know the "type" of a object prior to retrieval. This value < is returned along with the internal object identifier in the search < response. It has been encoded into the path part of the URL in < order to make the URL sufficient for the retrieval of the object. < Within the WAIS world, names do not of course need to be prefixed < by "wais:" (by the partial form rules). --- > This prefix describes the access using the "whois++" scheme in the > process of definition. The host name part is the same as for other > IP based schemes. The path part can be either a whois handle for a > whois object, or it can be a valid whois query string. This is a > subject for further study. 708a772,775 > NETWORK MANAGEMENT DATABASE > > This is a subject for study. > 712,715c779,785 < conforming URL syntax, using a new prefix. Experimental prefixes < may be used by mutual agreement between parties, and must start < with the characters "x-". The scheme name "urn:" is reserved for < the work in progress on a scheme for more persistent names. --- > conforming URL syntax, using a new scheme identifier. Experimental > scheme identifiers may be used by mutual agreement between parties, > and must start with the characters "x-". The scheme name "urn:" is > reserved for the work in progress on a scheme for more persistent > names. Therefore URNs (Names) and URLs (Locators) be > distinguishable. An object which is either a URL or a URN is known > as a URI (Identifier). 731c801 < retrieval by URL, that the client software have provision for being --- > retrieval by URI, that the client software have provision for being 735c805 < BNF for specific URL schemes --- > BNF syntax 737a808,812 > > > > Berners-Lee 14 > 739,742c814,817 < [brackets] indicate optional parts. Spaces are represented by the < word "space", and the vertical line character by "vline". Single < letters stand for single letters. All words of more than one letter < below are entities described somewhere in this description. --- > [brackets] indicate optional parts. Spaces are representated by > the word "space", and the vertical line character by "vline". > Single letters stand for single letters. All words of more than one > letter below are entities described somewhere in this description. 744,745c819,820 < The current IETF URI working group preference is for the < prefixedurl production. (Nov 1993. July 93: url). --- > The current IETF URI working group prefereence is for the > prefiexedurl production. (Nov 1993. July 93: url). 749,754c824 < characters do not appear in any productions and therefore may not < < < < Berners-Lee 13 < --- > characters fo not appear in any productions and therefore may not 769c839 < | mailtoaddress | midaddress | cidaddress --- > | mailtoaddress 778c848 < ftpaddress f t p : / / login / path [ ! ftptype ] --- > ftpaddress f t p : / / login / path 786,789d855 < midaddress m i d : addr-spec < < cidaddress c i d : content-identifier < 799a866,870 > > > > Berners-Lee 15 > 808,812d878 < < < < Berners-Lee 14 < 839,840d904 < ftptype A | I | D < 851c915 < path void | segment [ / path ] --- > path void | xpalphas [ / path ] 853,854d916 < segment xpalphas < 862,865d923 < < gtype xalpha < < xalpha alpha | digit | safe | extra | escape 869c927 < Berners-Lee 15 --- > Berners-Lee 16 870a929,932 > gtype xalpha > > xalpha alpha | digit | safe | extra | escape > 885c947 < digit 0 |1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 --- > 0 |1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 889c951 < extra " | ' | ( | ) | : | ; | , | space --- > extra ! | * | " | ' | ( | ) | : | ; | , | space 891,892d952 < reserved ! | * < 910,911d969 < (end of URL BNF) < 920,923c978,980 < A URL-related security threat is that it is sometimes possible to < construct a URL such that an attempt to perform a harmless < idempotent operation such as the retrieval of the object will in < fact cause a possibly damaging remote operation to occur. The --- > The use of URLs containing passwords is clearly unwise. > > Conclusion 927c984,985 < Berners-Lee 16 --- > > Berners-Lee 17 929,938c987,994 < unsafe URL is typically constructed by specifying a port number < other than that reserved for the network protocol in question. The < client unwittingly contacts a server which is in fact running a < different protocol. The content of the URL contains instructions < which when interpreted according to this other protocol cause an < unexpected ooperation. An example has been the use of gopher URLs < to cause a rude message to be sent via a SMTP server. Caution < should be used when using any URL which specifies a port number < other than the default for the protocol, especially when it is a < number within the reserved space. --- > A need has been demonstrated, and a number of requirements have > been stated for uniform resource locators (URLs). A scheme has been > proposed which builds on existing conventions to define a syntax > for URLs. This scheme has been in serious use by World-Wide Web > (W3) initiative since 1991. Adoption of the scheme in > correspondence, standards and software will ease the use of > references to on-line information in a flexible way as the coming > information age arrives. 940,948d995 < Care should be taken when URLs contain embedded encoded delimiters < for a given protocol (for example, CR and LF characters for telnet < protocols) that these are not unencoded before transmission. This < would violate the protocol but could be used to simulate an extra < operation or parameter, again causing an unexpected and possible < harmful remote operation to be performed. < < The use of URLs containing passwords is clearly unwise. < 968c1015 < Amsterdam IETF and refined in net discussion. --- > Amsterdam IETF and refined in net discussion. 970,972d1016 < The draft 03 includes changes made at Houston in Nov 93, and on the < net before Seattle March 1994. < 977c1021 < Wrappers for URIs in plain text --- > Fragment-id 979c1023,1027 < This section does not formally form part of the URL specification . --- > This represents a part of, fragment of, or a sub-function within, > an object or object. Its syntax and semantics are defined by the > application responsible for the object, or the specification of the > content type of the object. The only definition here is of the > allowed characters by which it may be represented in a URL. 981c1029,1039 < URIs, including URLs, will ideally be transmitted though protocols --- > The fragment-id follows the URL of the whole object from which it > is separated by a hash sign (#). If the fragment-id is void, the > hash sign may be omitted: A void fragment-id with or without the > hash sign means that the URL refers to the whole object. > > While this hook is allowed for identification of fragments, the > question of addressing of parts of objects, or of the grouping of > objects and relationship between contined and containing objects, > is not addressed by this object. > > This object does not address the question of objects which are 985c1043 < Berners-Lee 17 --- > Berners-Lee 18 986a1045,1111 > different versions of a "living" object, nor of expressing the > relationships between different versions and the living object. > > Partial form > > In a certain limited set of cases, generally within a certain > application, it may be useful to pass only a section of the URL. > Within a object whose URL is well defined, the URL of another > object may be given in abbreviated form, where parts of the two > URLs are the same. This allows objects within a group to refer to > each other without requiring the space for a complete reference, > and it incidentally allows the group of objects to be moved > without changing any references. This is not discussed in detail > here, it is only mentioned so that the characters required by the > technique be reserved for that purpose. It must be emphasised that > when a reference is passed in anything other than a well controlled > context, the full form must always be used. > > The partial form relies on a property of the URL syntax that > certain characters ("/") and certain path elements ("..", ".") have > a significance reserved for representing a hierarchical space, and > must be recognised as such by both clients and servers. > > A partial form can be distinguished from a full form in that a full > form must have a colon and that colon must occur before any slash > characters. > > The rules for the use of a partial name are: > > If the scheme parts are different, the whole absolute locator > must be given. Otherwise, the scheme is omitted, and: > > If the host and/or port parts are the different, the host, port > name and all the rest of the locator must be given. > > If the access and host parts are the same, then the path may be > given in absolute (fully qualified) or relative form. Within the > path: > > If a leading slash is present, the path is absolute. Otherwise, > a relative path is interpreted as follows: > > The last part of the path of the context locator (anything > following the rightmost slash) is removed, and the given partial > URL appended in its place. > > Within the result, all occurrences of "xxx/../" or "/." are > recursively removed, where xxx, ".." and "." are complete path > elements. > > Note: If a path of the context locator end in slash, partial URLs > will be treated differently to their treatment with respect to the > same path without a slash. Using a trailing slash on a directory > > > > Berners-Lee 19 > > name is not therefore recommended. The signifcance of a trailing > slash may be considered as that of the locator of a file with void > name within that directory. > > Wrappers for URIs in plain text > > This section does not formally form part of the URL specification. > > URIs, including URLs, will ideally be transmitted though protocols 1005,1006c1130,1133 < Yes, Jim, I found it under <ftp://info.cern.ch/pub/www/doc> but < you can probably pick it up from <ftp://ds.internic.net/rfc>. --- > Yes, Jim, I found it under <ftp://info.cern.ch/pub> bu > t > you can probably pick it up from <ftp://ds.internic.ne > t/rfc>. 1009d1135 < 1022,1024c1148,1150 < December 1991, as updated from time to time, < <ftp://info.cern.ch/pub/www/doc/http-spec.txt < > --- > December 1991, > <ftp://info.cer > n.ch/pub/www/doc/http-spec.txt> 1029a1156,1160 > > > > Berners-Lee 20 > 1040,1047d1170 < < < < Berners-Lee 18 < < Horton (1987) M. Horton, R. Adams, "Standard for < interchange of USENET messages", Internet RFC < 1036 , 12/01/1987. 1062c1185 < transmission of news" , Internet RFC-977, --- > transmission of news", Internet RFC-977, 1066,1068d1188 < Kunze, 1994 J. Kunze, Requirements for URLs, to be < published. < 1092,1094d1211 < Sollins 1994 K. Sollins and L. Masinter, Requiremnets for < URNs, to be published. < 1097d1213 < Performance Systems International, Inc. 1101c1217 < Berners-Lee 19 --- > Berners-Lee 21 1102a1219 > Performance Systems International, Inc. 1109,1112c1226,1228 < . < < AUTHOR'S ADDRESS < --- > Author's address > > 1122a1239 > 1126d1242 < 1160c1276 < Berners-Lee 20 --- > Berners-Lee 22
- how to make progress on the URL document Mark P. McCahill
- Re: how to make progress on the URL document Erik Huizer (SURFnet BV)
- Re: how to make progress on the URL document Mark P. McCahill
- Re: how to make progress on the URL document Tim Berners-Lee
- Re: how to make progress on the URL document Mark P. McCahill
- gopher URLs (was Re: how to make progress ...) Keith Moore
- Re: gopher URLs (was Re: how to make progress ...) dob
- Re: how to make progress on the URL document Tim Berners-Lee
- Re: how to make progress on the URL document Tim Berners-Lee
- Re: how to make progress on the URL document Mark P. McCahill
- Re: how to make progress on the URL document Tim Berners-Lee
- Re: how to make progress on the URL document Mitra
- Re: how to make progress on the URL document Alexander Dupuy
- Re: how to make progress on the URL document Mark P. McCahill
- Re: how to make progress on the URL document hallam
- Re: how to make progress on the URL document Alexander Dupuy
- Re: how to make progress on the URL document hallam