Re: I18N Concensus - Generic Syntax Document

"Roy T. Fielding" <fielding@kiwi.ics.uci.edu> Fri, 07 March 1997 10:06 UTC

Received: from cnri by ietf.org id aa23283; 7 Mar 97 5:06 EST
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa04711; 7 Mar 97 5:06 EST
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id EAA02194 for uri-out; Fri, 7 Mar 1997 04:41:59 -0500 (EST)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with SMTP id EAA02189 for <uri@services.bunyip.com>; Fri, 7 Mar 1997 04:41:56 -0500 (EST)
Received: from paris.ics.uci.edu by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA19443 (mail destined for uri@services.bunyip.com); Fri, 7 Mar 97 04:41:54 -0500
Received: from kiwi.ics.uci.edu by paris.ics.uci.edu id aa29868; 7 Mar 97 1:37 PST
To: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
Cc: URI List <uri@bunyip.com>
Subject: Re: I18N Concensus - Generic Syntax Document
In-Reply-To: Your message of "Thu, 06 Mar 1997 20:40:08 +0100." <Pine.SUN.3.95q.970306203216.245a-100000@enoshima>
Date: Fri, 07 Mar 1997 01:37:25 -0800
From: "Roy T. Fielding" <fielding@kiwi.ics.uci.edu>
Message-Id: <9703070137.aa29868@paris.ics.uci.edu>
Sender: owner-uri@bunyip.com
Precedence: bulk

>+ It is recommended that UTF-8 [RFC 2044] be used to represent characters
>+ with octets in URLs, wherever possible.
>
>+ For schemes where no single character->octet encoding is specified,
>+ a gradual transition to UTF-8 can be made by servers make resources
>+ available with UTF-8 names on their own, on a per-server or a
>+ per-resource basis. Schemes and mechanisms that use a well-
>+ defined character->octet encoding which is however not UTF-8 should
>+ define the mapping between this encoding and UTF-8, because generic
>+ URL software is unlikely to be aware of and to be able to handle
>+ such specific conventions.

Here is where you lose me.  I have no desire to add a UTF-8 character
mapping table to our server.  An HTTP server doesn't need one -- its URLs are
either composed by computation (in which case knowing the charset is not
possible) or by derivation from the filesystem (in which case it will use
whatever charset the filesystem uses, and in any case has no way of
determining whether or not that charset is UTF-8).  The server doesn't care
and should not care.  It is therefore inappropriate to suggest that it should
add such a table when doing so would only bloat the server and slow-down
the URL<->resource mapping process.

>>    Data corresponding to excluded characters must be escaped in order
>>    to be properly represented within a URL.  However, there do exist
>>    some systems that allow characters from the "unwise" and "national"
>>    sets to be used in URL references (section 3); a robust
>>    implementation should be prepared to handle those characters when
>>    it is possible to do so.
>
>Change to:
>
>There exist some systems that allow characters/octets from the
>"unwise" and "others" sets to be used in URL references (section 3).
>Until a uniform representation for characters within URLs is firmly
>established, such practice is not stable with respect to transcoding
>and therefore should be avoided.
>However, robust implementations should be prepared to handle those
>octet values when it is possible to do so.

No thanks -- the existing paragraph is far better.  Transcoding is
not an issue unless they are already violating the specification,
in which case they are prepared to suffer the consequences.
The purpose of the paragraph is to prevent an implementer from
interpreting the spec too literally and crashing on a non-urlc
character.

.....Roy