Re: revised "generic syntax" internet draft
"Martin J. Duerst" <mduerst@ifi.unizh.ch> Tue, 22 April 1997 17:27 UTC
Received: from cnri by ietf.org id aa05720; 22 Apr 97 13:27 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa15684; 22 Apr 97 13:27 EDT
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id LAA02665 for uri-out; Tue, 22 Apr 1997 11:19:30 -0400 (EDT)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with SMTP id LAA02660 for <uri@services.bunyip.com>; Tue, 22 Apr 1997 11:19:23 -0400 (EDT)
Received: from josef.ifi.unizh.ch by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA05015 (mail destined for uri@services.bunyip.com); Tue, 22 Apr 97 11:19:04 -0400
Received: from enoshima.ifi.unizh.ch by josef.ifi.unizh.ch with SMTP (PP) id <23849-0@josef.ifi.unizh.ch>; Tue, 22 Apr 1997 17:13:43 +0200
Date: Tue, 22 Apr 1997 17:13:42 +0200
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: Keld J|rn Simonsen <keld@dkuug.dk>
Cc: John C Klensin <klensin@mci.net>, Dan Oscarsson <Dan.Oscarsson@trab.se>, Harald.T.Alvestrand@uninett.no, uri@bunyip.com, fielding@kiwi.ics.uci.edu
Subject: Re: revised "generic syntax" internet draft
In-Reply-To: <199704221106.NAA15049@dkuug.dk>
Message-Id: <Pine.SUN.3.96.970422164228.245X-100000@enoshima>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Sender: owner-uri@bunyip.com
Precedence: bulk
On Tue, 22 Apr 1997, Keld J|rn Simonsen wrote: > "Martin J. Duerst" writes: > > In particular, the "FORM-UTF8: Yes" I proposed is very similar > > to your proposal. To be able to label arbitrary "charset"s is > > an extension, but I don't think it is needed at this stage of > > ISO 10646 and Internet development. The way I put it usually > > is that currently, we have "chaos". There is no need to proceed > > to "labeled chaos" when we can proceed to "order" directly. > > The Universal Character Set really shows off its strength most > > directly for short and widely used strings such as URLs. > > My "URL-Charset:" header also goes along the "labelled chaos" that > we already have with HTML, Yes, it is similar to what we have with HTML. But there are significant differences in the properties of HTML and URLs that suggest that using different approaches might be a good idea: - Length: HTML is much longer than URLs, and tagging is therefore less of a burden. - Length again: HTML can benefit from using different "charset"s as a kind of "compression", this is less of an issue for URLs. - Round-trip vs. one way: URLs make a round trip from the originator and back to it, and they have to arrive there safely. HTML is more downstreams only, and never needs an exact match after many transformations. - Transcription by paper: URLs are transcribed on paper. Adding a charset tag on paper is very clumsy (think about http:[us-ascii]//www.ibm.com printed in a newspaper). It may look like we don't need that tag, because the characters are all identified, yet if we want to use the current URL software which compares URLs using octet identity, we have to transform the characters back into the octets that they originated from. > and then the coding of URLs in > anchors etc in the HTML markup. The natural thing there is that URLs > are encoded in the charset of the HTML document. So a request > for the URL would then have a header with the URL and then the > "URL-charset" of the HTML document. Straightforward. And we could > use equivalent mechanisms whether the URL was typed in or came from > a HTML document. This is indeed true, and part of our proposal. But this solves only part of the problem, namely the question: What characters do the octets you are currently manipulating in your computer actually represent. So for example if I type an "o" with a "/" on a Mac, it will be represented as 0xBF internally, and it is (implicitly and naturally) tagged Mac-Roman. And if I cut-copy- paste that character into another document, it will keep its identity, but because it is in a web page editor, it might change its representation e.g. to 0xF8, and be (implicitly and naturally) tagged Latin-1. Printed on paper, it will still be an "o with /", but it doesn't need any tagging for representation because the tagging is automatically and implicitly added when it is input again. The problem here starts when this "o with /" is converted to %HH, and when it is send to a server. Now I am not anymore interested in the encoding in which I currently keep the character (which is usually not too difficult to know), but in the encoding that the server is assuming the character will arive in. And if I don't take special measures, I have absolutely no idea about what that could be. Now there are several possibilities: 1) Add another tag, this time explicit, that has to be carried around *all the time* and separately from the information that might be around implicitly. As I said above, this is very ugly, and no current software is prepared for it. Also, it introduces the problem that the browser (which strips the tag and converts) has to know about a large number of charset's, more than just the pages it is used to display and stuff that it is otherwise used to. Knowledge that could be centralized has to be widely distributed. 2) Send the URL as is, with a "charset" information. The server would get URLs in all kinds of charsets, and would have to care on its own for how to convert them to the charset it is using. Also, we can't freely convert to %HH, because then we need to add a tag as to what we used when converting to %HH. 3) Define a single encoding (this obviously is UTF-8). This means that when you see an URL with beyond-ASCII characters in it, you will know that to convert it to %HH and send it to the browser, you have to use UTF-8. It's like the tag above, but just that there is only one possibility, and that this therefore doesn't have to be specified. 4) Have a knowledge database about different protocols/ schemes and the encoding they use (if they use a single one). Is very clumsy to write general URL software with nice interface. 5) Have a way to ask the server what charset it accepts. Again, this needs new protocol, the tag, instead of making a roundtrip, is served by the server on demand. This gets difficult especially if you have various encodings in various areas of the same server. Also, the client needs to know about lots of encodings. > Also the responsibility of handling the character > encoding incl conversion would be at the server side, which normally > would be the "offender" allowing strange things like non-ASCII URLs. Your proposal is probably very close to 2) above. I think it would be probable to deploy it for HTTP, but it would put more heavy burdens on the server than with UTF-8 (where the server just has to know UTF-8 and whatever it wants to use locally). Also, it would need to add a tag for when we convert to %HH. Regards, Martin.
- Re: revised "generic syntax" internet draft Foteos Macrides
- leading ".." (Re: revised ...) Gregory J. Woodhouse
- Re: revised "generic syntax" internet draft Roy T. Fielding
- Re: revised "generic syntax" internet draft Francois Yergeau
- Re: revised "generic syntax" internet draft Roy T. Fielding
- Re: revised "generic syntax" internet draft Martin J. Duerst
- Re: revised "generic syntax" internet draft Francois Yergeau
- Transcribing non-ascii URLs [was: revised "generi… Dan Connolly
- Re: revised "generic syntax" internet draft Edward Cherlin
- Re: Transcribing non-ascii URLs [was: revised "ge… Martin J. Duerst
- Re: revised "generic syntax" internet draft Roy T. Fielding
- Re: revised "generic syntax" internet draft Dan Oscarsson
- Re: revised "generic syntax" internet draft Martin J. Duerst
- Re: revised "generic syntax" internet draft John C Klensin
- Re: revised "generic syntax" internet draft Gary Adams - Sun Microsystems Labs BOS
- Re: revised "generic syntax" internet draft Larry Masinter
- Re: revised "generic syntax" internet draft Gary Adams - Sun Microsystems Labs BOS
- Re: revised "generic syntax" internet draft Chris Newman
- Re: revised "generic syntax" internet draft Chris Newman
- Re: revised "generic syntax" internet draft Keld J|rn Simonsen
- Re: revised "generic syntax" internet draft Roy T. Fielding
- Re: revised "generic syntax" internet draft Chris Newman
- Re: revised "generic syntax" internet draft Roy T. Fielding
- Re: revised "generic syntax" internet draft Roy T. Fielding
- Re: revised "generic syntax" internet draft Roy T. Fielding
- Re: revised "generic syntax" internet draft Edward Cherlin
- Re: revised "generic syntax" internet draft Larry Masinter
- Re: revised "generic syntax" internet draft Harald.T.Alvestrand
- Re: revised "generic syntax" internet draft Roy T. Fielding
- Re: revised "generic syntax" internet draft Jon Knight
- Re: revised "generic syntax" internet draft Jon Knight
- Re: revised "generic syntax" internet draft John C Klensin
- Re: revised "generic syntax" internet draft Ron Daniel, Jr.
- Re: Transcribing non-ascii URLs [was: revised "ge… Bert Bos
- Re: revised "generic syntax" internet draft Gary Adams - Sun Microsystems Labs BOS
- Re: revised "generic syntax" internet draft Gary Adams - Sun Microsystems Labs BOS
- Re: revised "generic syntax" internet draft Gary Adams - Sun Microsystems Labs BOS
- A workable alternative to "hex-encoded UTF-8 enco… Larry Masinter
- Re: revised "generic syntax" internet draft Larry Masinter
- Re: revised "generic syntax" internet draft Larry Masinter
- Re: revised "generic syntax" internet draft John C Klensin
- Re: revised "generic syntax" internet draft Harald.T.Alvestrand
- Re: revised "generic syntax" internet draft Martin J. Duerst
- Re: revised "generic syntax" internet draft Roy T. Fielding
- Re: revised "generic syntax" internet draft Chris Newman
- Re: revised "generic syntax" internet draft Martin J. Duerst
- Re: A workable alternative to "hex-encoded UTF-8 … Martin J. Duerst
- Re: revised "generic syntax" internet draft Martin J. Duerst
- Re: revised "generic syntax" internet draft Martin J. Duerst
- Re: revised "generic syntax" internet draft Martin J. Duerst
- Re: revised "generic syntax" internet draft Martin J. Duerst
- Re: revised "generic syntax" internet draft Martin J. Duerst
- Re: revised "generic syntax" internet draft Martin J. Duerst
- Re: revised "generic syntax" internet draft Martin J. Duerst
- Re: revised "generic syntax" internet draft Larry Masinter
- Re: revised "generic syntax" internet draft Martin J. Duerst
- Re: revised "generic syntax" internet draft Martin J. Duerst
- Re: revised "generic syntax" internet draft Keld J|rn Simonsen
- Re: revised "generic syntax" internet draft Keld J|rn Simonsen
- Re: revised "generic syntax" internet draft Jonathan Rosenne
- Re: revised "generic syntax" internet draft Keld J|rn Simonsen
- Re: revised "generic syntax" internet draft Martin J. Duerst
- Re: revised "generic syntax" internet draft Keld J|rn Simonsen
- Re: revised "generic syntax" internet draft Martin J. Duerst
- Re: revised "generic syntax" internet draft Edward Cherlin
- Opaque right hand sides (was: Re: revised "generi… John C Klensin
- Re: revised "generic syntax" internet draft Karen R. Sollins
- UTF-8 and URLs Larry Masinter
- Re: UTF-8 and URLs Dan Connolly
- Re: UTF-8 and URLs Chris Newman
- Re: UTF-8 and URLs John C Klensin
- Re: UTF-8 and URLs Francois Yergeau
- Re: UTF-8 and URLs Dan Connolly
- Re: revised "generic syntax" internet draft Edward Cherlin
- Re: revised "generic syntax" internet draft John C Klensin
- Re: revised "generic syntax" internet draft Keld J|rn Simonsen
- Re: UTF-8 and URLs Martin J. Duerst
- Re: UTF-8 and URLs Francois Yergeau
- Re: UTF-8 and URLs Dan Connolly
- Re: revised "generic syntax" internet draft Martin J. Duerst
- Re: revised "generic syntax" internet draft Martin J. Duerst
- Re: revised "generic syntax" internet draft Martin J. Duerst
- New proposal (was Re: UTF-8 and URLs) Edward Cherlin
- Re: UTF-8 and URLs Larry Masinter
- Re: revised "generic syntax" internet draft Martin J. Duerst
- Re: UTF-8 and URLs Martin J. Duerst
- initial "relative-looking" elements. Larry Masinter
- Re: revised "generic syntax" internet draft Edward Cherlin
- Re: initial "relative-looking" elements. Roy T. Fielding