Re: Using UTF-8 for non-ASCII Characters in URLs
Alain LaBont/e'/ <alb@riq.qc.ca> Sat, 17 May 1997 15:47 UTC
Received: from cnri by ietf.org id aa12103; 17 May 97 11:47 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa07334; 17 May 97 11:47 EDT
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id LAA12658 for uri-out; Sat, 17 May 1997 11:31:35 -0400 (EDT)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with ESMTP id LAA12653 for <uri@services.bunyip.com>; Sat, 17 May 1997 11:31:29 -0400 (EDT)
Received: from socrate.riq.qc.ca (socrate.riq.qc.ca [199.84.128.1]) by mocha.bunyip.com (8.8.5/8.8.5) with SMTP id LAA14575 for <uri@bunyip.com>; Sat, 17 May 1997 11:31:26 -0400 (EDT)
Received: from riq-44-132.riq.qc.ca by socrate.riq.qc.ca (5.x/SMI-SVR4) id AB09131; Sat, 17 May 1997 11:36:15 -0400
Message-Id: <3.0.1.32.19970517105734.006c3860@riq.qc.ca>
X-Sender: alb@riq.qc.ca
X-Mailer: Windows Eudora Pro Version 3.0.1 beta 14 (32) [F]
Date: Sat, 17 May 1997 10:57:34 -0400
To: "Martin J. Duerst" <mduerst@ifi.unizh.ch>, Dan Oscarsson <Dan.Oscarsson@trab.se>
From: Alain LaBont/e'/ <alb@riq.qc.ca>
Subject: Re: Using UTF-8 for non-ASCII Characters in URLs
Cc: masinter@parc.xerox.com, Gary.Adams@east.sun.com, uri@bunyip.com
In-Reply-To: <Pine.SUN.3.96.970516224529.6801j-100000@enoshima>
References: <199705020952.LAA10593@valinor.malmo.trab.se>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Sender: owner-uri@bunyip.com
Precedence: bulk
Content-Transfer-Encoding: quoted-printable
X-MIME-Autoconverted: from 8bit to quoted-printable by services.bunyip.com id LAA12658
A 23:10 97-05-16 +0200, Martin J. Duerst a écrit : >On Fri, 2 May 1997, Dan Oscarsson wrote: > >> > > > <A HREF="this-is-the-URL">this-is-what-the-user-sees</A> >> > > > >> > > > The URL in the 'this-is-the-URL' part should use hex-encoded-UTF8, >> > > > no matter what the user sees. >> > > > >> > > >> > > If you use hex-encoding, yes. But NOT if you use the native character set >> > > of the document. In that case, the 'this-is-the-URL' part must >> > > use the same character set as the rest of the html document. Raw UTF-8 >> > > may only be used in a UTF-8 encoded html document, not in a iso 8859-1 >> > > encoded document. >> > >> > The document character set for HTML 2.0 and 3.2 was iso 8859-1. >> > The document character set for HTML 4.0 and XML will be iso 10646. >> As iso 8859-1 is a true subset of iso 10646 I assume that html 4.0 >> will also handle iso 8859-1 encoded documents, otherwise it will break >> a lot of html pages and software of today. [Martin] : >The document character set for HTML, or XML, is not usually identical >to the encoding ("charset") that is used for transmitting or storing >the document. >Please see the reference processing model in RFC 2070 for explanations. >The use of raw encoding of some type inside a text document encoded >with some other "charset" is in all cases very ill-advised. > >In fully implemented URLs including internationalization, there would >be three possibilities for transmitting the characters of an URL >(each of which could be used alternately for characters in the same URL): > >1) Character encoded as UTF-8 and then encoded with %HH. > >2) Character encoded as numeric character reference: &#nnnn;, > where nnnn is the decimal number of the character in > ISO 10646/Unicode. [In XML, and possibly also in future > versions of HTML, there will be a variant of this, > namely &#xhhhh;, where hhhh is the hexadecimal representation > of the same character, in the same standards.] > >3) Character encoded in the "charset" of the document. > > >Some examples: > >a) The letter "w": 1) "%77"; 2) "w" [or "w"]; 3) "w" > >b) The letter u-umlaut: 1) "%C3%BC"; 2) "ü" [or "ü"]; > 3) if the "charset" is iso-8859-1, then an octet 0xFC, not > representable here. Alternatively, "ü", available > only for certain characters. > > >I guess this could go into Larry's draft more or less directly. >Of course, we can add advice about preferred representations >and deployment (for the moment, %HH is more stable than the others, >except in trivial cases such as the "w" above). > >But people will start to type the characters into HTML URLs when >they see them e.g. in their file system viewers and can type them >e.g. into their browsers. That will happen just naturally. And >we better made sure that things work instead of trying to >rule it out. And they definitely work better with 3) than with >raw UTF-8 in a document that is not encoded as UTF-8. This has >been explained by François on his page quite some time ago :-). I could not agree more with what Martin says... I'm very pleased... That describes reality in a very concise way... Alain LaBonté Québec
- Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Dan Connolly
- Re: Using UTF-8 for non-ASCII Characters in URLs Michael Kung <MKUNG.US.ORACLE.COM>
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Dan Oscarsson
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Dan Oscarsson
- Re: Using UTF-8 for non-ASCII Characters in URLs Gary Adams - Sun Microsystems Labs BOS
- Re: Using UTF-8 for non-ASCII Characters in URLs Gary Adams - Sun Microsystems Labs BOS
- Re: Using UTF-8 for non-ASCII Characters in URLs Francois Yergeau
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Michael Kung <MKUNG.US.ORACLE.COM>
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Dan Oscarsson
- Re: "Difficult Characters" draft Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Edward Cherlin
- Re: Using UTF-8 for non-ASCII Characters in URLs Chris Newman
- Re: "Difficult Characters" draft Larry Masinter
- Re: "Difficult Characters" draft Alain LaBont/e'/
- Re: "Difficult Characters" draft Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: "Difficult Characters" draft Leslie Daigle
- Re: "Difficult Characters" draft Alain LaBont/e'/
- Re: "Difficult Characters" draft Martin J. Duerst
- Re: "Difficult Characters" draft Patrik Faltstrom
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Alain LaBont/e'/