Re: Using UTF-8 for non-ASCII Characters in URLs
Dan Oscarsson <Dan.Oscarsson@trab.se> Fri, 02 May 1997 10:25 UTC
Received: from cnri by ietf.org id aa16841; 2 May 97 6:25 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa06814; 2 May 97 6:25 EDT
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id FAA22173 for uri-out; Fri, 2 May 1997 05:53:19 -0400 (EDT)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with ESMTP id FAA22168 for <uri@services.bunyip.com>; Fri, 2 May 1997 05:53:12 -0400 (EDT)
Received: from malmo.trab.se (malmo.trab.se [131.115.48.10]) by mocha.bunyip.com (8.8.5/8.8.5) with ESMTP id FAA18521 for <uri@bunyip.com>; Fri, 2 May 1997 05:53:09 -0400 (EDT)
Received: from valinor.malmo.trab.se (valinor.malmo.trab.se [131.115.48.20]) by malmo.trab.se (8.7.5/TRAB-primary-2) with ESMTP id LAA08361; Fri, 2 May 1997 11:52:32 +0200 (MET DST)
Received: by valinor.malmo.trab.se (8.7.5/TRM-1-KLIENT); Fri, 2 May 1997 11:52:32 +0200 (MET DST) (MET)
Date: Fri, 02 May 1997 11:52:32 +0200
From: Dan Oscarsson <Dan.Oscarsson@trab.se>
Message-Id: <199705020952.LAA10593@valinor.malmo.trab.se>
To: masinter@parc.xerox.com, Gary.Adams@east.sun.com
Subject: Re: Using UTF-8 for non-ASCII Characters in URLs
Cc: uri@bunyip.com
Mime-Version: 1.0
Content-MD5: YCI3+LHsEXDIkjo5SqdD5Q==
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: owner-uri@bunyip.com
Precedence: bulk
> > > <A HREF="this-is-the-URL">this-is-what-the-user-sees</A> > > > > > > The URL in the 'this-is-the-URL' part should use hex-encoded-UTF8, > > > no matter what the user sees. > > > > > > > If you use hex-encoding, yes. But NOT if you use the native character set > > of the document. In that case, the 'this-is-the-URL' part must > > use the same character set as the rest of the html document. Raw UTF-8 > > may only be used in a UTF-8 encoded html document, not in a iso 8859-1 > > encoded document. > > The document character set for HTML 2.0 and 3.2 was iso 8859-1. > The document character set for HTML 4.0 and XML will be iso 10646. As iso 8859-1 is a true subset of iso 10646 I assume that html 4.0 will also handle iso 8859-1 encoded documents, otherwise it will break a lot of html pages and software of today. > > A large amount of html documents are hand written in a text editor. A user > > can not be expected to use a different encoding when typing the URLs > > in a document. > > But they might have to use a different encoding when saving the file > to disk. And the document itself might be converted as it is saved > to disk. These are common functions in a multibyte plain text editor, > just as intelligent cut and paste functions are needed in a shared > desktop environment. > > I think your point about "authoring URLs" within HTML documents with > a "plain text editor" is that the user will have a local input > method for entering native characters (e.g., compose key sequences, > virtual keyboard, radical composition, etc.) which will be operating > in the same manner for document text and for URL characters. Since the > authoring tools did not offer a means of recording the character encoding > information, it is not possible for a web server to make a distinction > when a document is transmitted on the wire. From another mail: >> >> In general, URLs used without a context that defines the characters used, >> should be encoded using UTF-8. URLs used within a context where the >> meaning of the characters is defined should use the character encoding >> of the context. > >I'm not sure that it is a good idea to tie the URL encoding >interpretation to its immediate context. If I attempt to >"Save" a document from the browser (or a spider agent is gathering >documents automatically from the web), then the characters of the URL >are often used to form a local file system name for the fetched object. >So I fetch a SJIS named file via an http server and save it in my >~/public_html EUC-jp file system. Using UTF8 on the wire (if >prearranged) allows both sites to use meaningful names for their local >resources and to safely share the public handles for the information. Maybe I was unclear. Text that is handled on a system does normally have a defined character set. If I do cut/copy the text that is copied does have a known character set and will be converted into a new if pasted into a document of a different character set (if used on a system that handles different character sets at the same time). If an editor edits the characters in ISO 10646, it can save them in a totally different character set by converting the character the other character set. When I edit a html document with a text editor, it is just text. URLs enbedded in the text is written using the same character set that all other text is in, if I paste a filename from a file listning in an other tool, the filename will end up in the same character set as all other characters in the text. URLs I write will contain 8-bit characters using the same character set as the rest of the text. When I use a web browser it will fetch html documents containing URLs. If I click on a link the browser need to extract the URL from the text, translate it into UTF-8 and send it to a web server. If I "Save" a document I fetched, the filename proposed will be in the character set of my filesystem. All this if the browser is international UTF-8 URL aware. Otherwise only %XX encoded URLs will work for sure. UTF-8 should be used on the wire when the protocol says: here is a URL. If the protocol says: here is a html document, the document need not be in UTF-8, it may be in iso 8859-1, UCS-2, UCS-4 and embedded URLs will be in the same character set. It is a simple matter for a web browser to extract the embedded URLs and translate them into UTF-8 for the wire, it is a very hevy burden for a web server to parse every html document and translate the embedded URLs into UTF-8. I think it is important that the document text and the URL "text" of in the document embedded URLs are of the same character set. If you have a system with SJIS encoded documents and EUC-jp for file names, I assume that editors in that system knows that when you save a document to a file it will use EUC-jp for the filename and if you copy a piece of the SJIS text into the file name dialog field, it will convert the text from SJIS to EUC-jp. No problem then with extrating text from a document and using if for something with a different character set. Is it clear now that URLs (and file names) typed in a document need to be in the same character set as the document? Dan
- Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Dan Connolly
- Re: Using UTF-8 for non-ASCII Characters in URLs Michael Kung <MKUNG.US.ORACLE.COM>
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Dan Oscarsson
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Dan Oscarsson
- Re: Using UTF-8 for non-ASCII Characters in URLs Gary Adams - Sun Microsystems Labs BOS
- Re: Using UTF-8 for non-ASCII Characters in URLs Gary Adams - Sun Microsystems Labs BOS
- Re: Using UTF-8 for non-ASCII Characters in URLs Francois Yergeau
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Michael Kung <MKUNG.US.ORACLE.COM>
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Larry Masinter
- Re: Using UTF-8 for non-ASCII Characters in URLs Dan Oscarsson
- Re: "Difficult Characters" draft Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Edward Cherlin
- Re: Using UTF-8 for non-ASCII Characters in URLs Chris Newman
- Re: "Difficult Characters" draft Larry Masinter
- Re: "Difficult Characters" draft Alain LaBont/e'/
- Re: "Difficult Characters" draft Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: "Difficult Characters" draft Leslie Daigle
- Re: "Difficult Characters" draft Alain LaBont/e'/
- Re: "Difficult Characters" draft Martin J. Duerst
- Re: "Difficult Characters" draft Patrik Faltstrom
- Re: Using UTF-8 for non-ASCII Characters in URLs Martin J. Duerst
- Re: Using UTF-8 for non-ASCII Characters in URLs Alain LaBont/e'/