[hybi] Framing and design philosophy

Ian Hickson <ian@hixie.ch> Thu, 29 July 2010 23:26 UTC

Return-Path: <ian@hixie.ch>
X-Original-To: hybi@core3.amsl.com
Delivered-To: hybi@core3.amsl.com
Received: from localhost (localhost []) by core3.amsl.com (Postfix) with ESMTP id 7B8D83A68AB for <hybi@core3.amsl.com>; Thu, 29 Jul 2010 16:26:02 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.481
X-Spam-Status: No, score=-2.481 tagged_above=-999 required=5 tests=[AWL=0.118, BAYES_00=-2.599]
Received: from mail.ietf.org ([]) by localhost (core3.amsl.com []) (amavisd-new, port 10024) with ESMTP id PzcKfkPXczBG for <hybi@core3.amsl.com>; Thu, 29 Jul 2010 16:26:00 -0700 (PDT)
Received: from homiemail-a55.g.dreamhost.com (caibbdcaaaaf.dreamhost.com []) by core3.amsl.com (Postfix) with ESMTP id 1B8903A67F1 for <hybi@ietf.org>; Thu, 29 Jul 2010 16:26:00 -0700 (PDT)
Received: from homiemail-a55.g.dreamhost.com (localhost []) by homiemail-a55.g.dreamhost.com (Postfix) with ESMTP id E68872C006A for <hybi@ietf.org>; Thu, 29 Jul 2010 16:26:23 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=hixie.ch; h=date:from:to:subject :message-id:mime-version:content-type; q=dns; s=hixie.ch; b=sbvs GCtuPpBznNI6++rMBb1pYdt2RrQ24+eiZ2zLs8oFFB1kOTA7iBPPGHyI9heUpt96 tT5ZNjupfQtPZZjTSRs3+7icRTOAQyQkZvJ39fgWHl5SUyqV982veUjUpOO5mqPe az2IBFbM4hxHrahOExabAG/SmkeIDnyXR6BId/A=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=hixie.ch; h=date:from:to :subject:message-id:mime-version:content-type; s=hixie.ch; bh=ow hNCyc/Np/152AjqzN30VzjcoI=; b=Ls9IfnjJwnE/cGWgsYMW9INcnKcXX3BZfY 6YNYJzOKPsHHYgu9exwZHNoiYEUfbyqvgXZXamJ2VVhOoCLQeUXv2MvZeqnlpeZq RfUzkfcyc6O3DJ6LuJfXTyvVRFW3pHqkpxihDBf8o6/tDrMVFqiyyDA+LXmCEFwz rLwbmjWpk=
Received: from ps20323.dreamhostps.com (ps20323.dreamhost.com []) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: internal@index.hixie.ch) by homiemail-a55.g.dreamhost.com (Postfix) with ESMTPSA id CAFC62C0061 for <hybi@ietf.org>; Thu, 29 Jul 2010 16:26:23 -0700 (PDT)
Date: Thu, 29 Jul 2010 23:26:23 +0000
From: Ian Hickson <ian@hixie.ch>
To: hybi@ietf.org
Message-ID: <Pine.LNX.4.64.1007292242310.7470@ps20323.dreamhostps.com>
Content-Language: en-GB-hixie
Content-Style-Type: text/css
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Subject: [hybi] Framing and design philosophy
X-BeenThere: hybi@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Server-Initiated HTTP <hybi.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/hybi>, <mailto:hybi-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/hybi>
List-Post: <mailto:hybi@ietf.org>
List-Help: <mailto:hybi-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/hybi>, <mailto:hybi-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 29 Jul 2010 23:26:02 -0000

It has been suggested that I should try to explain the reasoning behind 
the current design of the framing in the WebSocket specification.

As background, the idea behind WebSockets is to provide TCP for scripts 
running in Web browsers. There are several issues with doing that that 
prevent us from literally just using TCP; most are security-related and 
are dealt with by the handshake, which I won't discuss here.

There is one other issue, though, with exposing raw TCP to scripts running 
in Web browsers: TCP provides a stream, whereas the Web browser platform 
operates on an event-based system with discrete units of content. Browsers 
need to know when to notify the script that more content is available.

We could simply expose a stream interface to the scripts, in particular 
with a "read" method that returns all data collected so far. However, this 
is the kind of model that leads to subtle bugs: in testing (on reliable 
networks), one will find packets do not get split and take minimal time, 
making it likely that authors will rely on a "read" method always 
providing a single complete packet of information. In the real world, 
packets get fragmented, packets arrive out of order, packets get delayed, 
and thus a "read" can easily include either only part of a packet, or all 
of a packet plus more data from another.

As part of making the Web platform a reliable experience for authors, we 
have to try to avoid exposing authors to this kind of issue. Web 
developers have repeatedly shown that they are likely to depend on the 
weirdest of quirks; this is not a minor problem. This suggests that we 
should have a market in the stream to indicate when the UA is to notify 
the script that data is available; such a marker would also allow the data 
to be provided in predictable units selected by the server.

If the data is UTF-8, an easy way to demark different events of data is to 
use a 0xFF byte between each event's worth of data.

On its face, this seems fine, but going forward we will need to support 
binary data, and in binary data you can't reserve a particular byte in 
that way, instead we need length-prefixed frames. They can be chunked or 
fixed size or any number of other similar mechanisms, but they are based 
on size and not on delimiters.

Now we could use this for the text data too, but this leads to two 
unrelated problems.

The first is that the browser will need to distinguish text data and 
binary data, because they need to be exposed to scripts differently. This 
is easily solved; simply prefix each event's data with a byte saying what 
kind of data it is.

The second is that if we use length prefixing for text data, it is very 
easy for author to end up measuring the length of their ASCII strings in 
characters and outputting that as the byte length of the data, which ends 
up leading to very subtle bugs when non-ASCII UTF-8 characters are used. 
Since kind of bug is very hard to spot if you don't think anyone is going 
to be trying to mess around with your data, and yet can lead to very 
drastic security problems: one can effectively smuggle in new frames by 
tricking the server into sending the wrong length.

This latter problem is not present when we simply use 0xFF markers to 
delimit frames, because it is much harder to smuggle in a 0xFF byte (you 
can't do it using Web Sockets, you have to use another vector), and 
there's no length to confuse.

Thus, the framing in the current spec. Each frame is prefixed by a byte 
saying whether it's a text frame or a binary frame; text frames are 
marker-delimitted, and binary frames a length based.

If we could assume that programmers don't make silly mistakes, we wouldn't 
need all of this: we would just use raw TCP after the handshake and not 
have framing at all. Simply expose the stream. But we can't assume that. 
Programmers make mistakes, and we have to take that into account and 
design around it.

This isn't hypothetical. Consider the postMessage() API. It is designed on 
the assumption that the authors using the API will not make mistakes. It 
is a completely secure API in principle, but it relies on the author 
carefully checking the origin of incoming messages and carefully setting 
the target of outgoing messages. Usage of this API was examined by some 
security researchers at Berkeley:


They studied two systems in particular, Google Friend Connect and Facebook 
Connect. Both were found to have all kinds of security problems. The study 
concludes that these problems are in part to blame on the design of the 
API -- it did not "minimize the liability that the user undertakes to 
ensure application security".

We have to realise, as protocol designers, that this applies to us too. We 
have to design our protocols so that it is harder to implement them 
insecurely than to design them securely. I don't pretend to think that Web 
Sockets is perfect in this regard; but I think it behooves us to start 
from that point of view.

For example, it would be easy to suggest that we should simplify the 
protocol and not have the marker-delimited frames. However, if we did this 
we would have to find some other way to mitigate the issue with string 
measurements. (I've considered several, but I haven't found one that would 
be as effective as a marker. For example, requiring all strings to have a 
three-byte UTF-8 character at the start would simply get implemented by 
hardcoding those bytes and outputting the length+3, which doesn't help.)

In closing, here are some examples of people writing Web Socket code today 
that are running afoul of these kinds of problems. Here is someone's 
attempt at writing a server using read() to read the whole handshake at 
once into a buffer and then perform simplified parsing on that buffer:


Notice how the code assumes that the handshake isn't fragmented into two 
packets that arrive with a delay.

Here is another example of a server, this one written in JavaScript:


Notice how it assumes the client is trustworthy and happily "eval"s code 
sent from the client without any input verification.

Here is another, written in PHP:


This one uses regular expressions to parse the headers. If we had just one 
key in the handshake ("Sec-WebSocket-Key" instead of "Sec-WebSocket-Key1" 
and "Sec-WebSocket-Key2") and if we didn't have the crazy "count spaces 
and divide by the number of spaces" thing, it would be vulnerable to the 
key being stuffed into the resource name in an XHR request, a cross- 
protocol attack from HTTP to WebSockets.

I don't see any way to avoid the first two bugs by protocol design (I'm 
glad the design we have does catch the third). But we need to think about 
this kind of thing when designing our protocol.

Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'