Re: draft-fielding-url-syntax-05.txt

Chris Newman <> Fri, 02 May 1997 18:32 UTC

Received: from cnri by id aa00201; 2 May 97 14:32 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa17833; 2 May 97 14:32 EDT
Received: (from daemon@localhost) by (8.8.5/8.8.5) id OAA08198 for uri-out; Fri, 2 May 1997 14:06:50 -0400 (EDT)
Received: from (mocha.Bunyip.Com []) by (8.8.5/8.8.5) with ESMTP id OAA08193 for <>; Fri, 2 May 1997 14:06:47 -0400 (EDT)
Received: from THOR.INNOSOFT.COM (SYSTEM@THOR.INNOSOFT.COM []) by (8.8.5/8.8.5) with ESMTP id OAA22698 for <>; Fri, 2 May 1997 14:06:45 -0400 (EDT)
Received: from by INNOSOFT.COM (PMDF V5.1-8 #8694) with SMTP id <01IIE4AVJ0YY99GK0T@INNOSOFT.COM> for; Fri, 2 May 1997 11:05:36 PDT
Date: Fri, 02 May 1997 11:07:04 -0700
From: Chris Newman <>
Subject: Re: draft-fielding-url-syntax-05.txt
In-reply-to: <>
To: Larry Masinter <>
Cc: IETF URI list <>
Message-id: <>
MIME-version: 1.0
Content-type: TEXT/PLAIN; charset="US-ASCII"
Originator-Info: login-id=chris;
Precedence: bulk

On Fri, 2 May 1997, Larry Masinter wrote:

> 2. URL Characters and Escape Sequences
>    URLs consist of a restricted set of characters, primarily chosen to
>    aid transcribability and usability both in computer systems and in
>    non-computer communications. Characters used conventionally as
>    delimiters around URLs were excluded.  The restricted set of
>    characters consists of digits, letters, and a few graphic symbols
>    were chosen from those common to most of the character encodings
>    and input facilities available to Internet users.
>    Within a URL, characters are either used as delimiters, or to
>    represent strings of data (octets) within the delimited portions.
>    Octets are either represented directly by a character (using the
>    US-ASCII character for that octet) or by an escape encoding.  This
>    representation is elaborated below.
> 2.1 URLs and non-ASCII characters   
>    While URLs are sequences of characters and those characters are
>    used (within delimited sections) to represent sequences of octets,
>    in some cases those sequences of octets are used (via a 'charset'
>    or character encoding scheme) to represent sequences of characters:
>    URL char. sequence <-> octet sequence <-> original char. sequence
>    In cases where the original character sequence contains characters
>    that are strictly within the set of characters defined in the
>    US-ASCII character set, the mapping is simple: each original
>    character is translated into the US-ASCII code for it, and
>    subsequently represented either as the same character, or as an
>    escape sequence.
>    In general practice, many different character encoding schemes are
>    used in the second mapping (between sequences of represented
>    characters and sequences of octets) and there is generally no
>    representation in the URL itself of which mapping was used. While
>    there is a strong desire to provide for a general and uniform
>    mapping between more general scripts and URLs, the standard for
>    such use is outside of the scope of this document.

I find this much too wishy-washy.  I think we should explicitly forbid the
use of 8-bit characters and hex-encoded 8-bit characters, except as
defined by the future I18N URL standard.  We need to make it very clear
that programs sending 8-bit URLs over the wire are broken (unless they use
UTF8 according to the future standard).