Re: [ietf-types] The application/www-form-urlencoded format

"Stephen D. Williams" <> Mon, 27 September 2010 07:46 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 516353A69FA for <>; Mon, 27 Sep 2010 00:46:38 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.999
X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[AWL=0.600, BAYES_00=-2.599]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id Kon5iLsB9nna for <>; Mon, 27 Sep 2010 00:46:37 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 7D2173A6C7A for <>; Mon, 27 Sep 2010 00:46:37 -0700 (PDT)
Received: from sdwmbp.local ( []) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: sdw) by (Postfix) with ESMTP id 060CCAB5E33; Mon, 27 Sep 2010 00:55:31 -0700 (PDT)
Message-ID: <>
Date: Mon, 27 Sep 2010 00:47:14 -0700
From: "Stephen D. Williams" <>
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv: Gecko/20100915 Thunderbird/3.1.4
MIME-Version: 1.0
To: Bjoern Hoehrmann <>
References: <> <op.vjmuz10364w2qv@anne-van-kesterens-macbook-pro.local> <> <>
In-Reply-To: <>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
Subject: Re: [ietf-types] The application/www-form-urlencoded format
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: "Media \(MIME\) type review" <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Mon, 27 Sep 2010 07:46:38 -0000

  On 9/26/10 1:32 PM, Bjoern Hoehrmann wrote:
> * Stephen D. Williams wrote:
>>   On 9/26/10 2:22 AM, Anne van Kesteren wrote:
>>> On Sat, 25 Sep 2010 23:14:39 +0200, Bjoern Hoehrmann<>  wrote:
>>>> -- the draft de-
>>>> scribes the application/www-form-urlencoded format, a variant of the
>>>> application/x-www-form-urlencoded format first described in RFC 1866.
>> What the character encoding errors are at the end of section 5 needs to
>> be explained, at least for 2 cases.
> I believe the first encodes a pair of surrogate code points, the second
> a character beyond U+10FFFF, the first is an overlong sequence, then it
> is a truncated sequence, and finally you have an illegal starter with
> ASCII after it. They are all prohibited by the definition of UTF-8. I do
> not think that kind of technical detail would be well-placed there, but
> I could rephrase so it refers to UTF-8 directly instead of calling it
> "character encoding errors" though that's the same thing here.

In real life, say in Java, processing such data would throw an exception in the String library decoding methods.  That's fine, 
however it is not really about the www-form-urlencoded level format/parsing, but the next level down, UTF-8.  When I read that 
section, I expected to see data malformed at that syntax level: "%HA%##", etc.  You can simply say that data that is not valid UTF-8 
produces undefined results (i.e. exceptions, etc.).  That's true of the format itself.