http charset labelling

Keld J|rn Simonsen <keld@dkuug.dk> Wed, 31 January 1996 19:20 UTC

Received: from ietf.nri.reston.va.us by IETF.CNRI.Reston.VA.US id aa18376; 31 Jan 96 14:20 EST
Received: from CNRI.Reston.VA.US by IETF.CNRI.Reston.VA.US id aa18372; 31 Jan 96 14:20 EST
Received: from services.Bunyip.COM by CNRI.Reston.VA.US id aa11938; 31 Jan 96 14:20 EST
Received: (from daemon@localhost) by services.bunyip.com (8.6.10/8.6.9) id LAA25455 for uri-out; Wed, 31 Jan 1996 11:56:15 -0500
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.6.10/8.6.9) with SMTP id LAA25449 for <uri@services.bunyip.com>; Wed, 31 Jan 1996 11:56:09 -0500
Received: from dkuug.dk by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA16272 (mail destined for uri@services.bunyip.com); Wed, 31 Jan 96 11:55:59 -0500
Received: (from keld@localhost) by dkuug.dk (8.6.12/8.6.12) id RAA19593 for uri@bunyip.com; Wed, 31 Jan 1996 17:55:54 +0100
Message-Id: <199601311655.RAA19593@dkuug.dk>
Sender: ietf-archive-request@IETF.CNRI.Reston.VA.US
From: Keld J|rn Simonsen <keld@dkuug.dk>
Date: Wed, 31 Jan 1996 17:55:50 +0100
X-Charset: ISO-8859-1
X-Char-Esc: 29
Mime-Version: 1.0
Content-Type: Text/Plain; Charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Mnemonic-Intro: 29
X-Mailer: Mail User's Shell (7.2.2 4/12/91)
To: uri@bunyip.com
Subject: http charset labelling
X-Orig-Sender: owner-uri@bunyip.com
Precedence: bulk

Glenn Adams writes:

>     The problem still
>     exist that not all characters of the world is in ISO 10646.
> 
> Yes, I agree.  That is why I proposed a charset tagging syntax to be
> intrinsic to URLs.

I think there is agreement that we need some charset tagging on URLs.

The problem is how to tag it.

1. I am not sure I understand what Glenn is writing here,
would intrinsic be in the sense that MIME has for its headers,
the ?=charset=? thing? Something along the lines where
in the first part of the URL you have the protocol specification
userid and password and socket number and the domain, there could
be a field for a charset. This would then be an extention of the
URL syntax. An example could be after the port number:

      http://www.dkuug.dk:80:utf-8/maits/

2. Another place is at the end of a GET / POST request in HTTP, an
example:
  
      GET http://www.dkuug.dk/maits/ HTTP/1.1 utf-8

3. yet another place could be in headers for the GET request:

      GET http://www.dkuug.dk/maits/ HTTP/1.1
      Url-Charset: utf-8

Discussion:

1. is general to all URL usage, so there would be no need
to update protocols. Anyway a server using HTTP/1.0 would not
understand this notion, and thus it would create havoc (I think).
The other thing is that specifying a charset in a URL is not the right
place to do it, it should not be nessecary to specify charsets of urls
in newspapers and business cards, as we agreed that URLs were 
coding independent information.

2. is http specific - It may cause some http/1.0 servers to
goof as there is a parameter that it does not expect.

3. should be backwards compatible, as servers may ignore 
headers they don't understand (as per the http 1.0 spec)
and they have a good chance of understanding the URL that is there
- possibly in semi-official iso-8859-1 anyway (URLs are 7-bit,
http is 8-bit iso-8859-1 per default)

So basically there is not much difference between 2. and 3. -
they are protocol specific and do not touch URL syntax. 
I dislike 1. as it implies writing encoding in the URL.

keld