Re: Using UTF-8 for non-ASCII Characters in URLs
Gary Adams - Sun Microsystems Labs BOS <Gary.Adams@east.sun.com> Wed, 30 April 1997 13:10 UTC
Received: from cnri by ietf.org id aa10439; 30 Apr 97 9:10 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa09666; 30 Apr 97 9:10 EDT
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id IAA19100 for uri-out; Wed, 30 Apr 1997 08:38:19 -0400 (EDT)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with ESMTP id IAA19095 for <uri@services.bunyip.com>; Wed, 30 Apr 1997 08:38:17 -0400 (EDT)
Received: from mercury.Sun.COM (mercury.Sun.COM [192.9.25.1]) by mocha.bunyip.com (8.8.5/8.8.5) with SMTP id IAA00909 for <uri@bunyip.com>; Wed, 30 Apr 1997 08:38:15 -0400 (EDT)
Received: from East.Sun.COM ([129.148.1.241]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id FAA27985; Wed, 30 Apr 1997 05:48:00 -0700
Received: from suneast.East.Sun.COM by East.Sun.COM (SMI-8.6/SMI-5.3) id IAA22232; Wed, 30 Apr 1997 08:37:39 -0400
Received: from zeppo.East.Sun.COM by suneast.East.Sun.COM (SMI-8.6/SMI-SVR4) id IAA19085; Wed, 30 Apr 1997 08:37:41 -0400
Received: by zeppo.East.Sun.COM (SMI-8.6/SMI-SVR4) id IAA25725; Wed, 30 Apr 1997 08:37:40 -0400
Date: Wed, 30 Apr 1997 08:37:40 -0400
From: Gary Adams - Sun Microsystems Labs BOS <Gary.Adams@east.sun.com>
Message-Id: <199704301237.IAA25725@zeppo.East.Sun.COM>
To: Dan.Oscarsson@trab.se, masinter@parc.xerox.com
Cc: uri@bunyip.com
Subject: Re: Using UTF-8 for non-ASCII Characters in URLs
Sender: owner-uri@bunyip.com
Precedence: bulk
> From: Dan Oscarsson <Dan.Oscarsson@trab.se> ... > > > > Dan, for each item in a directory listing, there are two entries. > > > > <A HREF="this-is-the-URL">this-is-what-the-user-sees</A> > > > > The URL in the 'this-is-the-URL' part should use hex-encoded-UTF8, > > no matter what the user sees. > > > > If you use hex-encoding, yes. But NOT if you use the native character set > of the document. In that case, the 'this-is-the-URL' part must > use the same character set as the rest of the html document. Raw UTF-8 > may only be used in a UTF-8 encoded html document, not in a iso 8859-1 > encoded document. The document character set for HTML 2.0 and 3.2 was iso 8859-1. The document character set for HTML 4.0 and XML will be iso 10646. From what little I know about SGML, the document must be converted to a single document character set before the SGML parser is allowed to operate on the markup. http://www.w3.org/pub/WWW/MarkUp/Cougar/ http://www.w3.org/pub/WWW/TR/WD-xml-961114.html#sec2.2 http://www.w3.org/pub/WWW/TR/WD-xml-961114.html#sec4.2.3 If I use a multilingual text editor to create my *ML documents and "paste" a raw UTF8 url into the href field, the editor either 'negotiates for the encoding information' from the desktop clipboard service or it assumes the sending application is using the same encoding that it needs. So when I cut the EUC-jp URL from my browser "location" window and paste it into my editor it may just assume the bits are iso8859-1 characters. For experimenting with combined document authoring/browsing functions the "w3 for emacs" browser and the "psgml-mode" editor in the Xemacs 20.0(with MULE support) provide a good platform for experimentation. http://www.xemacs.org/faq/xemacs-faq.html#internationalization > > A large amount of html documents are hand written in a text editor. A user > can not be expected to use a different encoding when typing the URLs > in a document. But they might have to use a different encoding when saving the file to disk. And the document itself might be converted as it is saved to disk. These are common functions in a multibyte plain text editor, just as intelligent cut and paste functions are needed in a shared desktop environment. I think your point about "authoring URLs" within HTML documents with a "plain text editor" is that the user will have a local input method for entering native characters (e.g., compose key sequences, virtual keyboard, radical composition, etc.) which will be operating in the same manner for document text and for URL characters. Since the authoring tools did not offer a means of recording the character encoding information, it is not possible for a web server to make a distinction when a document is transmitted on the wire. \ /gra
- Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Dan Connolly
- Re: Using UTF-8 for non-ASCII Characters in URLs Michael Kung <MKUNG.US.ORACLE.COM>
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Dan Oscarsson
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Dan Oscarsson
- Re: Using UTF-8 for non-ASCII Characters in URLs Gary Adams - Sun Microsystems Labs BOS
- Re: Using UTF-8 for non-ASCII Characters in URLs Gary Adams - Sun Microsystems Labs BOS
- Re: Using UTF-8 for non-ASCII Characters in URLs Francois Yergeau
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Michael Kung <MKUNG.US.ORACLE.COM>
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Dan Oscarsson
- Re: "Difficult Characters" draft Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Edward Cherlin
- Re: Using UTF-8 for non-ASCII Characters in URLs Chris Newman
- Re: "Difficult Characters" draft Larry Masinter
- Re: "Difficult Characters" draft Alain LaBont/e'/
- Re: "Difficult Characters" draft Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: "Difficult Characters" draft Leslie Daigle
- Re: "Difficult Characters" draft Alain LaBont/e'/
- Re: "Difficult Characters" draft Martin J. Duerst
- Re: "Difficult Characters" draft Patrik Faltstrom
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Alain LaBont/e'/