Re: [http-state] Ticket 11: Character encoding for non-ASCII cookies values

Adam Barth <> Wed, 03 March 2010 14:44 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id CA9423A7622 for <>; Wed, 3 Mar 2010 06:44:35 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.977
X-Spam-Status: No, score=-1.977 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, FM_FORGED_GMAIL=0.622]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id kZ-vltG5R-0J for <>; Wed, 3 Mar 2010 06:44:35 -0800 (PST)
Received: from ( []) by (Postfix) with ESMTP id EA1C93A6C40 for <>; Wed, 3 Mar 2010 06:44:34 -0800 (PST)
Received: by with SMTP id 9so380797qwb.31 for <>; Wed, 03 Mar 2010 06:44:33 -0800 (PST)
Received: by with SMTP id d9mr4548070qah.203.1267627467135; Wed, 03 Mar 2010 06:44:27 -0800 (PST)
Received: from ( []) by with ESMTPS id 2sm16731582qwi.41.2010. (version=SSLv3 cipher=RC4-MD5); Wed, 03 Mar 2010 06:44:26 -0800 (PST)
Received: by with SMTP id 9so380727qwb.31 for <>; Wed, 03 Mar 2010 06:44:25 -0800 (PST)
MIME-Version: 1.0
Received: by with SMTP id o16mr1338803qcd.93.1267627465322; Wed, 03 Mar 2010 06:44:25 -0800 (PST)
In-Reply-To: <>
References: <> <> <> <>
From: Adam Barth <>
Date: Wed, 3 Mar 2010 06:44:05 -0800
Message-ID: <>
To: Achim Hoffmann <>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: "Roy T. Fielding" <>, http-state <>
Subject: Re: [http-state] Ticket 11: Character encoding for non-ASCII cookies values
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Discuss HTTP State Management Mechanism <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Wed, 03 Mar 2010 14:44:35 -0000

On Wed, Mar 3, 2010 at 6:32 AM, Achim Hoffmann <> wrote:
> Adam Barth wrote on 03.03.2010 06:46:
>> On Tue, Mar 2, 2010 at 5:08 PM, Roy T. Fielding <> wrote:
>>> On Mar 2, 2010, at 4:24 PM, Adam Barth wrote:
>>>> The draft treats the cookie values as opaque octets throughout for use
>>>> on the wire.  I've added a SHOULD-level requirement to use a UTF8 when
>>>> converting the octets to characters (e.g., for use in the user agent's
>>>> user interface).
>>>> Given that the encoding issue doesn't appear to affect
>>>> interoperability on the wire, I think a SHOULD-level recommendation is
>>>> appropriate here.  If specific APIs (e.g., document.cookie) have more
>>>> specific needs, they can add additional requirements.
>>>> Thoughts?
>>> I think that is fine if it is made clear that UTF-8 is only applicable
>>> after the field value is extracted from the rest of the message.  I.e.,
>>> the HTTP parser must be ASCII-based and thus not vulnerable to
>>> invalid Unicode byte sequences.
>> Hopefully that should be clear in the draft.  The encoding is mention
>> at the end of the serialization section (which is two sections after
>> the parsing section).
> IIRC previous discussions revealed that some browsers allow arbitrary data
> for the cookie.

That's correct.

> If the draft now (phase 1) recommends a special encoding, i.e. UTF-8, then
> it violates the status quo. Does it?

Not quite.  The draft still allows arbitrary data for the on-the-wire
protocol, which is what I meant when I wrote: "The draft treats the
cookie values as opaque octets throughout for use on the wire."  The
UTF-8 recommendation is for other uses of cookie data, such as display
in the user agent's user interface.

> If a coding like UTF-8 is recommended in phase 2, then this may result in
> a transforming/canonicalisation/best-fit-mapping nightmare again.
> Think of browser APIs (like JavaScript) which might use a different
> encoding (i.e. UCS-2 as in ECMA-262).

We can worry about phase 2 issues in phase 2.

> Do I miss something here? If not all cookie data (key=value) SHOULD (phase 1)
> or MUST (phase 2) be URL encoded.

No one is saying anything about URL encoding.

> This leaves the final data format open to whatever the application and/or
> the browser wants but is transparent and secure on protocol level.

The server is free to use URL encoding.  In fact, the draft recommends
that the server encodes whatever data it wants to store into a
server-selected printable ASCII encoding (URL encoding, base64, etc).