Re: html, http, urls and internationalisation

Keld J|rn Simonsen <keld@dkuug.dk> Wed, 31 January 1996 12:24 UTC

Received: from ietf.nri.reston.va.us by IETF.CNRI.Reston.VA.US id aa09992; 31 Jan 96 7:24 EST
Received: from CNRI.Reston.VA.US by IETF.CNRI.Reston.VA.US id aa09987; 31 Jan 96 7:24 EST
Received: from services.Bunyip.COM by CNRI.Reston.VA.US id aa05298; 31 Jan 96 7:24 EST
Received: (from daemon@localhost) by services.bunyip.com (8.6.10/8.6.9) id GAA20115 for uri-out; Wed, 31 Jan 1996 06:18:20 -0500
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.6.10/8.6.9) with SMTP id GAA20109 for <uri@services.bunyip.com>; Wed, 31 Jan 1996 06:18:12 -0500
Received: from dkuug.dk by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA13261 (mail destined for uri@services.bunyip.com); Wed, 31 Jan 96 06:18:09 -0500
Received: (from keld@localhost) by dkuug.dk (8.6.12/8.6.12) id MAA11079; Wed, 31 Jan 1996 12:15:41 +0100
Message-Id: <199601311115.MAA11079@dkuug.dk>
Sender: ietf-archive-request@IETF.CNRI.Reston.VA.US
From: Keld J|rn Simonsen <keld@dkuug.dk>
Date: Wed, 31 Jan 1996 12:15:39 +0100
In-Reply-To: Larry Masinter <masinter@parc.xerox.com> "Re: html, http, urls and internationalisation" (Jan 31, 9:47)
X-Charset: ISO-8859-1
X-Char-Esc: 29
Mime-Version: 1.0
Content-Type: Text/Plain; Charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
Mnemonic-Intro: 29
X-Mailer: Mail User's Shell (7.2.2 4/12/91)
To: Larry Masinter <masinter@parc.xerox.com>, borka@e5.ijs.si
Subject: Re: html, http, urls and internationalisation
Cc: yergeau@alis.ca, Dan.Oscarsson@malmo.trab.se, maits@dkuug.dk, uri@bunyip.com
X-Orig-Sender: owner-uri@bunyip.com
Precedence: bulk

Larry Masinter writes:

> > What Keld said is sound and could be worked further. THe major
> > restriction is the DNS part and this should be kept as it is
> > (character < 127). The same applies to the syntax characters.
> 
> No, "what Keld said" isn't "sound" it is just "sounds nice".

Glad you like the sound effects, Larry!

> Keld said, for example,
> 
> > 1. URLs themselves.
> 
> > These are at an abstract character level, as Larry and Franc,ois
> > correctly points out, you cannot see what is the charset
> > when you look at a business card or an URL in the newspaper.
> 
> > I propose that any character here be allowed, except for the 
> > URL syntax characters, (things like < / : ) - in the non-DNS
> > part of the URL. Remember these are abstract characters, and
> > there is no binding to for example ISO 10646 in the sense
> > of a character repertoire, or to any encoding (charset).
> 
> However, this nice-sounding proposal contained no solution to the
> following questions:
> 
> 1)how do these abstract characters subsequently get turned
>   into octets that are employed in real protocols in general
>   and http and ftp in particular?
>   (The current URL specification gives an algorithm.)

>From glyphs on paper to a computer system, eg. a browser:
by having the human recognise (aka "read") the characters and enter
them, as is normally done.

>From a html doc into a http request: The html doc has a
charset, and the http request url is represented in a charset.
So the html string with the URL is converted into the http 
charset, and then the URL is sent with high bits encoded according
to the url specifications (in %xx notation). I found no ways
of specifying a charset in the current rfcs on URLs.

I did specify the transformations and encodings in earlier mail.
> 
> 2)how does one translate a URL that uses a large character
>   repertoire so that it might be written in a context with 
>   a small repertoire? E.g., a URL with chinese characters
>   in an ASCII email message.
>   (The current URL specification manages this by limiting
>   the repertoire.)

That was also described in the previous mailing, about the html I said:

> >Here it should be possible to write a HTML document in a given
> >charset, and then reference the (abstract) characters in the URL, just
> >like it is possible to write characters in the rest of the HTML document.
> >That is, the normal characters of the document charset can be used,
> >like full iso-8859-1 in normal HTML docs, and full Unicode in 
> >Unicode docs. Also the way of generating out-of-band characters
> >should be allowed in HTML URL strings, like &a-ring and &#xxxx;

> I don't think these problems are unsolvable, but I think in the course
> of making a "sound" proposal you'll find that it starts "sounding"
> less and less like something that you'd want to implement.

I think most of the concerns have been addressed in what I wrote,
but anyway there may be finer details in it that needs to be sharpened
and and it needs to be cast in concrete specs.

I think most of the specs are already there and ready to be employed
in an implementation.

> So, I'll ask again, PLEASE stop cross-posting this discussion to three
> separate mailing lists.

OK, taken ad notam.

Keld