URL syntax: characters and octets
"Martin J. Duerst" <mduerst@ifi.unizh.ch> Thu, 19 December 1996 21:21 UTC
Received: from cnri by ietf.org id aa14809; 19 Dec 96 16:21 EST
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa21070;
19 Dec 96 16:21 EST
Received: (from daemon@localhost) by services.bunyip.com (8.6.10/8.6.9) id
OAA06088 for uri-out; Thu, 19 Dec 1996 14:22:05 -0500
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by
services.bunyip.com (8.6.10/8.6.9) with SMTP id OAA06083 for
<uri@services.bunyip.com>; Thu, 19 Dec 1996 14:22:03 -0500
Received: from josef.ifi.unizh.ch by mocha.bunyip.com with SMTP
(5.65a/IDA-1.4.2b/CC-Guru-2b)
id AA12372 (mail destined for uri@services.bunyip.com);
Thu, 19 Dec 96 14:21:54 -0500
Received: from enoshima.ifi.unizh.ch by josef.ifi.unizh.ch with SMTP (PP)
id <11930-0@josef.ifi.unizh.ch>; Thu, 19 Dec 1996 20:21:56 +0100
Date: Thu, 19 Dec 1996 20:21:54 +0100 (MET)
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: uri@bunyip.com
Subject: URL syntax: characters and octets
Message-Id: <Pine.SUN.3.95.961219194325.245V-100000@enoshima>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-uri@bunyip.com
Precedence: bulk
Hello everybody,
In my series of mails regarding the URL syntax draft
(draft-fielding-url-syntax-02.txt) I come to the most
serious issue, as far as I see it the issue that will
determine how early (or late) the draft can be advanced.
The draft writes:
> F.2. Modifications from both RFC 1738 and RFC 1808
>
> Confusion regarding the terms "character encoding", the URL
> "character set", and the escaping of characters with %<hex><hex>
> equivalents has (hopefully) been reduced.
Unfortunately, I have to say that these hopes are not met.
The draft uses the term "character" indiscriminately for both
the characters that represent an URL (on paper or wherever) and
the characters that are (may be!) represented by an URL.
This is very annoying for somebody who understands these things,
and extremely confusing to somebody who does not understand them.
Let's have a look at some examples:
The introduction says:
> Unlike many specifications which use a BNF-like grammar to define the
> bytes (octets) allowed by a protocol, the URL grammar is defined in
> terms of characters. Each literal in the grammar corresponds to the
> character it represents, rather than to the octet encoding of that
> character in any particular coded character set.
One would assume that these are the characters A-Z, a-z, 0-9, and some
more, in particular the "%", as we will find them on that famous napkin.
But later, the draft says:
> The set of characters allowed for use within URLs can be described in
> three categories: reserved, unreserved, and escaped.
>
> urlchar = reserved | unreserved | escaped
and defines "escaped" as:
> 2.3.1. Escaped Encoding
>
> An escaped character is encoded as a character triplet, consisting of
> the percent character "%" followed by the two hexadecimal digits
> representing the character's octet code in an 8-bit coded character
> set. For example, "%20" is the escaped encoding for the space
> character.
>
> escaped = "%" hex hex
One wonders: On my napkin, is "%20" one character, or are these 3
characters? Confusion is perfect.
I will refrain from more unnecessarily confusing examples, I just
can say that the draft is full of them. What are the possible
solutions?
(1) Make a clear distinction between URL characters (these would be
the "%", the "2", and the "0", and not the "%20" as currently)
and represented characters (could also be called encoded
characters, scheme characters, or something else),
which may include SPACE, control characters, dangerous
characters, and so on.
(2) Go back to the terminology of RFC 1738 and speak about *octets*
encoded as characters. Has its advantages, but is rather
too abstract and far from reality (where ASCII==ASCII).
(3) Work with three levels:
represented characters
|
v
octets
|
v
URL characters
This looks more complicated at first glance, but is in many cases
closer to reality, and less confusing. It allows different aspects
of the problem to be clearly separated.
(4) Add cautions: URLs not always represent characters, and/or not
always represent octets that are encoded directly (with %HH).
A classical example is the data: URL. It encodes raw octets,
it does not ultimately represent characters. But the octets
are not encoded as %HH, they are encoded with BASE64 into
a set of characters/octets that don't need %HH.
In my oppinion, the best solution would combine (1), (3), and (4).
I am willing to rewrite the text once the general direction to solve
these issues is found (and after I come back from vacation over
the holydays :-).
Regards, Martin.
----
Dr.sc. Martin J. Du"rst ' , . p y f g c R l / =
Institut fu"r Informatik a o e U i D h T n S -
der Universita"t Zu"rich ; q j k x b m w v z
Winterthurerstrasse 190 (the Dvorak keyboard)
CH-8057 Zu"rich-Irchel Tel: +41 1 257 43 16
S w i t z e r l a n d Fax: +41 1 363 00 35 Email: mduerst@ifi.unizh.ch
----
- URL syntax: characters and octets Martin J. Duerst