Re: [ietf-types] The application/www-form-urlencoded format

"Stephen D. Williams" <sdw@lig.net> Sun, 26 September 2010 19:38 UTC

Return-Path: <sdw@lig.net>
X-Original-To: ietf-types@core3.amsl.com
Delivered-To: ietf-types@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 061603A6B82 for <ietf-types@core3.amsl.com>; Sun, 26 Sep 2010 12:38:57 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.399
X-Spam-Level:
X-Spam-Status: No, score=-1.399 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, J_CHICKENPOX_14=0.6, J_CHICKENPOX_33=0.6]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id G8BOSWCLdh4O for <ietf-types@core3.amsl.com>; Sun, 26 Sep 2010 12:38:54 -0700 (PDT)
Received: from mail.lig.net (lig.net [64.69.38.223]) by core3.amsl.com (Postfix) with ESMTP id AD15E3A6BBF for <ietf-types@ietf.org>; Sun, 26 Sep 2010 12:35:07 -0700 (PDT)
Received: from sdwmbp.local (ligemail.lig.net [127.0.0.1]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: sdw) by mail.lig.net (Postfix) with ESMTP id 144EBA8D6F9 for <ietf-types@ietf.org>; Sun, 26 Sep 2010 12:43:33 -0700 (PDT)
Message-ID: <4C9FA0BB.8090106@lig.net>
Date: Sun, 26 Sep 2010 12:36:27 -0700
From: "Stephen D. Williams" <sdw@lig.net>
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.9) Gecko/20100915 Thunderbird/3.1.4
MIME-Version: 1.0
To: ietf-types@ietf.org
References: <k1os96p03o78p78490hei104biadpiepit@hive.bjoern.hoehrmann.de> <op.vjmuz10364w2qv@anne-van-kesterens-macbook-pro.local>
In-Reply-To: <op.vjmuz10364w2qv@anne-van-kesterens-macbook-pro.local>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
Subject: Re: [ietf-types] The application/www-form-urlencoded format
X-BeenThere: ietf-types@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: "Media \(MIME\) type review" <ietf-types.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ietf-types>, <mailto:ietf-types-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ietf-types>
List-Post: <mailto:ietf-types@ietf.org>
List-Help: <mailto:ietf-types-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-types>, <mailto:ietf-types-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 26 Sep 2010 19:38:57 -0000

  On 9/26/10 2:22 AM, Anne van Kesteren wrote:
> On Sat, 25 Sep 2010 23:14:39 +0200, Bjoern Hoehrmann <derhoermi@gmx.net> wrote:
>>   http://tools.ietf.org/html/draft-hoehrmann-urlencoded -- the draft de-
>> scribes the application/www-form-urlencoded format, a variant of the
>> application/x-www-form-urlencoded format first described in RFC 1866.

What the character encoding errors are at the end of section 5 needs to be explained, at least for 2 cases.
>
> I think it is unfortunate it still allows encoding in various ways. So while things could be more readable as you pointed out in 
> the past user agents are still allowed to obscure most everything.

It should be the case that only what has to be encoded is encoded.  This is required for any signature canonicalization anyway.  It 
could be the case that a more restricted situation exists, although I don't see it for form submission.
> ...
>>
>> The draft probably still has some rough edges in the prose but the
>> format is not going to change. I believe it addresses the feedback I
>> got since the first draft published four years ago; public feedback at
>> http://lists.w3.org/Archives/Public/www-archive/2006Sep/thread.html#msg30
>
> The bug about + seems to be still be there. Escapes are first decoded and then + is replaced with U+0020. Also 
> application/x-www-form-urlencoded is on its way of being standardized as part of HTML5 now.

I was just debugging an Oauth implementation (customized RFC 5849) that has this kind of mistake, plus they used the older set of 
unreserved characters rather than those defined in RFC 3986:
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

Some specs allowed more:
RFC 2396: Section 2.3:
unreserved = alphanum | mark
mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

Java's urlencoder seems broken as it is almost RFC 3986 but with '*' instead of '~':
http://download.oracle.com/javase/1.4.2/docs/api/java/net/URLEncoder.html
# The special characters "|.|", "|-|", "|*|", and "|_|" remain the same.
# The space character "| |" is converted into a plus sign "|+|".

For canonicalization purposes, RFC 5849 says in sction 3.6:
This method is different from the encoding scheme used by the
"application/x-www-form-urlencoded" content-type (for example, it
encodes space characters as "%20" and not using the "+" character).
It MAY be different from the percent-encoding functions provided by
web-development frameworks (e.g., encode different characters, use
lowercase hexadecimal characters).

They use uppercase hex and ' '->'%20 rather than the option of ' '->'+'.

If ' '->'+' is to be used, it has to be done after percent encoding so that '+' gets percent encoded before ' '->'+'. At this point 
however, most people can read "%20" as space almost as fast as '+'.  On the other hand, '+' is more compact and is still more 
readable.  A spec should deterministically choose one or the other so that signatures can be done without further agreement.  
Signing a form is a common need.

While these specs have different purposes, it is often the case (as with OAuth), that key/value pairs can show up in HTTP headers, 
POST bodies that may be www-form-urlencoded or something like XML, in the URL as a query string, along with being canonicalized for 
HMAC signature.  Unifying the operations reduces confusion and coding.

>
>
>>       Note: The media type does not have a 'charset' parameter, it
>>       is incorrect specify one and to associate any significance to
>>       it if specified. The character encoding is always UTF-8. The
>>       Unicode encoding form signature is not supported; a leading
>>       U+FEFF character will be considered part of a <name>.
>
> Most other such formats ignore a leading U+FEFF.

U+FEFF, the BOM, is only needed where the encoding can be something other than UTF-8.  If there is no choice, it isn't needed and 
probably shouldn't be present unless the situation might change.  If UTF-16 can be used, then it is needed.

>
>
>> [1] The regular expression to match both names is slightly simpler if
>>     they only differ in the "x-", and it seems fitting to standardize
>>     an "x-" type by removing the "x-", but if there is a good argument
>>     why using similar names is a bad idea, I am also quite open to name
>>     it application/name-value-pairs or something like that instead.
>
> Well, application/x-www-form-urlencoded is not going away anytime soon. It is what you get for <form> by default. That some 
> consider it non-standard is just semantics. Having said that I do not mind dropping the x- for an improved version of that format.
>
>

sdw

-- 
Stephen D. Williams sdw@lig.net stephendwilliams@gmail.com LinkedIn: http://sdw.st/in
V:650-450-UNIX (8649) V:866.SDW.UNIX V:703.371.9362 F:703.995.0407
AIM:sdw Skype:StephenDWilliams Yahoo:sdwlignet Resume: http://sdw.st/gres
Personal: http://sdw.st facebook.com/sdwlig twitter.com/scienteer