Re: Using UTF-8 for non-ASCII Characters in URLs
"Martin J. Duerst" <mduerst@ifi.unizh.ch> Fri, 16 May 1997 21:39 UTC
Received: from cnri by ietf.org id aa21626; 16 May 97 17:39 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa26961; 16 May 97 17:37 EDT
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id RAA16928 for uri-out; Fri, 16 May 1997 17:11:24 -0400 (EDT)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with ESMTP id RAA16923 for <uri@services.bunyip.com>; Fri, 16 May 1997 17:11:08 -0400 (EDT)
Received: from josef.ifi.unizh.ch (josef.ifi.unizh.ch [130.60.48.10]) by mocha.bunyip.com (8.8.5/8.8.5) with SMTP id RAA08589 for <uri@bunyip.com>; Fri, 16 May 1997 17:10:59 -0400 (EDT)
Received: from enoshima.ifi.unizh.ch by josef.ifi.unizh.ch with SMTP (PP) id <00917-0@josef.ifi.unizh.ch>; Fri, 16 May 1997 23:10:48 +0200
Date: Fri, 16 May 1997 23:10:23 +0200
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: Dan Oscarsson <Dan.Oscarsson@trab.se>
cc: masinter@parc.xerox.com, Gary.Adams@east.sun.com, uri@bunyip.com
Subject: Re: Using UTF-8 for non-ASCII Characters in URLs
In-Reply-To: <199705020952.LAA10593@valinor.malmo.trab.se>
Message-ID: <Pine.SUN.3.96.970516224529.6801j-100000@enoshima>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Sender: owner-uri@bunyip.com
Precedence: bulk
On Fri, 2 May 1997, Dan Oscarsson wrote: > > > > <A HREF="this-is-the-URL">this-is-what-the-user-sees</A> > > > > > > > > The URL in the 'this-is-the-URL' part should use hex-encoded-UTF8, > > > > no matter what the user sees. > > > > > > > > > > If you use hex-encoding, yes. But NOT if you use the native character set > > > of the document. In that case, the 'this-is-the-URL' part must > > > use the same character set as the rest of the html document. Raw UTF-8 > > > may only be used in a UTF-8 encoded html document, not in a iso 8859-1 > > > encoded document. > > > > The document character set for HTML 2.0 and 3.2 was iso 8859-1. > > The document character set for HTML 4.0 and XML will be iso 10646. > As iso 8859-1 is a true subset of iso 10646 I assume that html 4.0 > will also handle iso 8859-1 encoded documents, otherwise it will break > a lot of html pages and software of today. The document character set for HTML, or XML, is not usually identical to the encoding ("charset") that is used for transmitting or storing the document. Please see the reference processing model in RFC 2070 for explanations. The use of raw encoding of some type inside a text document encoded with some other "charset" is in all cases very ill-advised. In fully implemented URLs including internationalization, there would be three possibilities for transmitting the characters of an URL (each of which could be used alternately for characters in the same URL): 1) Character encoded as UTF-8 and then encoded with %HH. 2) Character encoded as numeric character reference: &#nnnn;, where nnnn is the decimal number of the character in ISO 10646/Unicode. [In XML, and possibly also in future versions of HTML, there will be a variant of this, namely &#xhhhh;, where hhhh is the hexadecimal representation of the same character, in the same standards.] 3) Character encoded in the "charset" of the document. Some examples: a) The letter "w": 1) "%77"; 2) "w" [or "w"]; 3) "w" b) The letter u-umlaut: 1) "%C3%BC"; 2) "ü" [or "ü"]; 3) if the "charset" is iso-8859-1, then an octet 0xFC, not representable here. Alternatively, "ü", available only for certain characters. I guess this could go into Larry's draft more or less directly. Of course, we can add advice about preferred representations and deployment (for the moment, %HH is more stable than the others, except in trivial cases such as the "w" above). But people will start to type the characters into HTML URLs when they see them e.g. in their file system viewers and can type them e.g. into their browsers. That will happen just naturally. And we better made sure that things work instead of trying to rule it out. And they definitely work better with 3) than with raw UTF-8 in a document that is not encoded as UTF-8. This has been explained by Francois on his page quite some time ago :-). Regards, Martin.
- Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Dan Connolly
- Re: Using UTF-8 for non-ASCII Characters in URLs Michael Kung <MKUNG.US.ORACLE.COM>
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Dan Oscarsson
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Dan Oscarsson
- Re: Using UTF-8 for non-ASCII Characters in URLs Gary Adams - Sun Microsystems Labs BOS
- Re: Using UTF-8 for non-ASCII Characters in URLs Gary Adams - Sun Microsystems Labs BOS
- Re: Using UTF-8 for non-ASCII Characters in URLs Francois Yergeau
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Michael Kung <MKUNG.US.ORACLE.COM>
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Dan Oscarsson
- Re: "Difficult Characters" draft Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Edward Cherlin
- Re: Using UTF-8 for non-ASCII Characters in URLs Chris Newman
- Re: "Difficult Characters" draft Larry Masinter
- Re: "Difficult Characters" draft Alain LaBont/e'/
- Re: "Difficult Characters" draft Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: "Difficult Characters" draft Leslie Daigle
- Re: "Difficult Characters" draft Alain LaBont/e'/
- Re: "Difficult Characters" draft Martin J. Duerst
- Re: "Difficult Characters" draft Patrik Faltstrom
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Alain LaBont/e'/