Re: revised "generic syntax" internet draft

"Martin J. Duerst" <> Mon, 14 April 1997 16:07 UTC

Received: from cnri by id aa00678; 14 Apr 97 12:07 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa13568; 14 Apr 97 12:07 EDT
Received: (from daemon@localhost) by (8.8.5/8.8.5) id LAA25362 for uri-out; Mon, 14 Apr 1997 11:23:56 -0400 (EDT)
Received: from (mocha.Bunyip.Com []) by (8.8.5/8.8.5) with SMTP id LAA25357 for <>; Mon, 14 Apr 1997 11:23:53 -0400 (EDT)
Received: from by with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA04259 (mail destined for; Mon, 14 Apr 97 11:23:50 -0400
Received: from by with SMTP (PP) id <>; Mon, 14 Apr 1997 17:22:18 +0200
Date: Mon, 14 Apr 1997 17:22:17 +0200
From: "Martin J. Duerst" <>
To: "Roy T. Fielding" <>
Cc: Francois Yergeau <>,
Subject: Re: revised "generic syntax" internet draft
In-Reply-To: <>
Message-Id: <Pine.SUN.3.96.970414164518.245E-100000@enoshima>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Precedence: bulk

On Sun, 13 Apr 1997, Roy T. Fielding wrote:

> >>The only question that matters is whether or not the draft as it
> >>currently exists is a valid representation of what the existing
> >>practice is
> >
> >The current spec doesn't do that.  Non-ASCII characters are routinely
> >rolled into URLs, yet the spec doesn't define the mapping.  IMHO, the spec
> >is not worthy of becoming a Draft Standard, in fact it doesn't even meet
> >one the requirements for Proposed Standard (from RFC 2026):
> >
> >   A Proposed Standard should have no known technical omissions
> >   with respect to the requirements placed upon it.
> Draft 04 does not have any such omission.  Non-ASCII characters are not
> allowed in URLs, period.  Any application that transmits a URL in
> non-ASCII characters is declared non-compliant.  There is no grey area.

The draft contains the following text (at the end of 2.4.3.):

   Data corresponding to excluded characters must be escaped in order
   to be properly represented within a URL.  However, there do exist
   some systems that allow characters from the "unwise" and "others"
   sets to be used in URL references (section 3); a robust
   implementation should be prepared to handle those characters when
   it is possible to do so.

I don't know if this is dark grey or light gray, but it is definitely
not black or white.

> >> and what the vendor community agrees is needed in the
> >>future to support interoperability.
> >
> >I'm not aware that the Internet standards process excludes non-vendors.
> It doesn't exclude anybody.  However, it isn't possible to claim
> "rough consensus" for any feature that nobody wants to implement.

As I have shown, implementation is trivial on some systems and
some configurations. On others (such as an arbitrary UNIX with Apache),
the structure to start it is already there.
And I am sure that with your excellent help, I could write a module that
handled at least the main cases within a few weeks or less. Do you
want to give it a try, or can you suggest somebody else from the
Apache group that I should contact?

> In the Internet standards process, "vendors" means people and organizations
> intending to ship implementations of the specified protocol.
> >>Since it is my opinion that it is NEVER desirable
> >>to show a URL in the unencoded form given in Francois' examples,
> >>you cannot claim to hold anything even remotely like consensus.
> >
> >A bit preposterous, isn't it?  *Your* opinion alone is enough to break any
> >consensus?
> Yes, it is.  That is the difference between "consensus" (what Martin
> was claiming) and "rough consensus".

In the Internet standards process, "consensus" means rough
consensus. I may not always have used the qualifiers, but
I of course ment "rough consensus".

> As it states quite clearly in the draft,
>    These design concerns are not always in alignment.  For example, it
>    is often the case that the most meaningful name for a URL component
>    would require characters which cannot be typed on most keyboards.
>    The ability to transcribe the resource location from one medium to
>    another was considered more important than having its URL consist
>    of the most meaningful of components.  In local and regional
>    contexts and with improving technology, users might benefit from
>    being able to use a wider range of characters.  However, such use
>    is not guaranteed to work, and should therefore be avoided.
> Your comments have done nothing to change the conclusions already
> represented within the draft.

The comments in the recent discussion have added two main new things:

- The use of a wider range of characters may easily be possible
	without any serious impact on transcribability (because
	we have to consider the actual users of an URL, not the
	"potential" users).
- The sentence "is not guaranteed to work" is less and less true.
	The main obstacle to it is the lack of an agreement for
	a recommended defined character encoding.

> >But these three also transmit documents in the charset that is found in the
> >document (transparency, no transcoding), yet you claimed loudly in the HTTP
> >WG that they somehow defaulted to ISO 8859-1, and insisted strongly that
> >this fictitious default charset remain in the HTTP/1.1 spec.

> They do default to ISO-8859-1.  Look at the Apache code.  Look at the
> NCSA code.  Look at the CERN code.  As one of the Apache developers,
> I can say unequivocally that any text/* response lacking an explicit
> charset means that the server intends the file to be treated as
> charset=iso-8859-1.  For example, that is how Apache determines which
> variant is best when negotiating on Accept-Charset.

That's a default for server side configuration. Apache is free to
introduce whatever defaults it sees convenient for its users or
for its implementation. It's a local issue. As somebody experienced
with internationalization and character coding issues, I would
suggest to consider to not have a default in order to force webmasters
explicitly to think about what data they serve and to avoid wrong
taggings. But this is not the issue here.

HTTP is dealing with what's going on on the wire. The "default"
of iso-8859-1 for text/* documents *on the wire* is not supportable
by actual practice.

Regards,	Martin.