Re: [hybi] Framing Take VI (a compromise proposal)

John Tamplin <jat@google.com> Sat, 14 August 2010 02:47 UTC

Return-Path: <jat@google.com>
X-Original-To: hybi@core3.amsl.com
Delivered-To: hybi@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id C69A73A687B for <hybi@core3.amsl.com>; Fri, 13 Aug 2010 19:47:46 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -104.238
X-Spam-Level:
X-Spam-Status: No, score=-104.238 tagged_above=-999 required=5 tests=[AWL=1.738, BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-4, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id lfCYewUuQClO for <hybi@core3.amsl.com>; Fri, 13 Aug 2010 19:47:44 -0700 (PDT)
Received: from smtp-out.google.com (smtp-out.google.com [74.125.121.35]) by core3.amsl.com (Postfix) with ESMTP id 111433A67A4 for <hybi@ietf.org>; Fri, 13 Aug 2010 19:47:43 -0700 (PDT)
Received: from hpaq1.eem.corp.google.com (hpaq1.eem.corp.google.com [172.25.149.1]) by smtp-out.google.com with ESMTP id o7E2mKhg004682 for <hybi@ietf.org>; Fri, 13 Aug 2010 19:48:20 -0700
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=google.com; s=beta; t=1281754100; bh=Dh0ROUeajdk0pwcyoyRgd5VMUKY=; h=MIME-Version:In-Reply-To:References:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=qvmpAypRous/Ab0BwilMw1/jnKHx8mrRwMNzdPUpcDU4GyGwX6BT0120VAbEeoP/z gLgwFzhi/ThbkHHAtIWlg==
DomainKey-Signature: a=rsa-sha1; s=beta; d=google.com; c=nofws; q=dns; h=mime-version:in-reply-to:references:from:date:message-id: subject:to:cc:content-type:x-system-of-record; b=tz5Reyk3eY4qyqUd6S1oqwkPEcGUiSkS4nT5YlTvkbTbh0tyFBdE1ojbbI2+91oM8 z3PXkS/brrHGxRSZWTGdA==
Received: from yxj4 (yxj4.prod.google.com [10.190.3.68]) by hpaq1.eem.corp.google.com with ESMTP id o7E2mGIa006360 for <hybi@ietf.org>; Fri, 13 Aug 2010 19:48:19 -0700
Received: by yxj4 with SMTP id 4so916140yxj.29 for <hybi@ietf.org>; Fri, 13 Aug 2010 19:48:18 -0700 (PDT)
Received: by 10.150.69.34 with SMTP id r34mr2792915yba.385.1281754098203; Fri, 13 Aug 2010 19:48:18 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.151.60.3 with HTTP; Fri, 13 Aug 2010 19:47:58 -0700 (PDT)
In-Reply-To: <2rlb66d01d7qn7qn8fbecr0a2tta768glk@hive.bjoern.hoehrmann.de>
References: <AANLkTi=TBXO_Cbb+P+e2BVfx69shkf8E1-9ywDh_Y+Kz@mail.gmail.com> <2rlb66d01d7qn7qn8fbecr0a2tta768glk@hive.bjoern.hoehrmann.de>
From: John Tamplin <jat@google.com>
Date: Fri, 13 Aug 2010 22:47:58 -0400
Message-ID: <AANLkTik9LrGoXxK0+v1orKF8rEUHnK0n+QEyHFR3wD-J@mail.gmail.com>
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Content-Type: multipart/alternative; boundary="000e0cd598e4b67b78048dbfa174"
X-System-Of-Record: true
Cc: hybi@ietf.org
Subject: Re: [hybi] Framing Take VI (a compromise proposal)
X-BeenThere: hybi@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Server-Initiated HTTP <hybi.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/hybi>, <mailto:hybi-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/hybi>
List-Post: <mailto:hybi@ietf.org>
List-Help: <mailto:hybi-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/hybi>, <mailto:hybi-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 14 Aug 2010 02:47:47 -0000

On Fri, Aug 13, 2010 at 10:24 PM, Bjoern Hoehrmann <derhoermi@gmx.net>wrote:

> * Ian Fette wrote:
> >> -- having a single opcode to start a fragmented message and separate
> >opcodes to determine if it is a text or binary message means you can't
> start
> >to decode UTF8 text until you receive the entire message, which means you
> >add a buffering requirement of the undecoded message
>
> The formatting of your mail and its HTML attachment is somewhat broken
> so I am not sure what I am responding to here, but the observation seems
> incorrect; http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for instance
> makes it rather easy to decode the bytes as you receive them, so long as
> they are not delivered out of order.


Dave's original proposal had one opcode to start a frame, and two separate
opcodes to end a frame - one for text, and one for binary.  That meant until
you read the entire message, you didn't know if it was going to be UTF8 or
not.  That is what the paragraph quoted is referring to -- you need to know
the type of frame you are receiving at the first frame so you can decode it
as you receive each fragment.


> >> - Question: are endpoints likely enough to use UTF16 for internal
> >representation of text that it would make sense to send the number of
> UTF16
> >characters instead of bytes as the message length or as an additional
> field
> >on text frames?
>
> Sending it .instead. is probably not an option as that would encourage
> some implementers to take shortcuts like sending just twice the number
> of bytes if they expect to only ever send US-ASCII. And sending it in
> addition would still mean the number could be wrong, and there are many
> unknowns (length of strings, which code points are in the strings, how
> scripts, in case of web browsers receiving text, use the text, how the
> recepient implements strings, and so on).
>
> The computer I am using right now is a AMD Athlon II X2 215 with some
> very cheap main memory and it can transcode UTF-8 to UTF-16 at about
> 500 KB per millisecond (using the latest version of my decoder, which
> is about the fastest I know of), that's three orders of magnitude re-
> moved from the computer's Internet connection's bandwidth. I don't see
> a particular indication that knowing the length of the UTF-16 buffer
> in advance would have a noticable effect on my browsing experience.
>

So imagine you are writing the code to receive a text WebSocket message.
 Ultimately, you want to pass some UTF16-based string to the client code.
 The total message length in bytes is available, but UTF8 characters of 1-5
bytes will convert to 1-2 UTF16 characters.  So, that means that (given
message length of n bytes from the first frame) you need to allocate
wchar_t[n] (or char in Java, etc) in case each character in the message is
US-ASCII and possibly waste storage when some non-ASCII characters are
included.  Another alternative is to allocate a smaller buffer and then
resize it in the event that it is not large enough.  If instead the number
of UTF16 characters is known from the first fragment of the message, you can
simply allocate the correct size and never have to reallocate.  So, it isn't
about the processing speed of converting UTF8->UTF16, but rather buffer
management.

The downside is not all implementations may want to use UTF16 representation
of the text data, in which case the value is useless.  So, I think if it
were useful, it would have to be in addition to the overall message length.

-- 
John A. Tamplin
Software Engineer (GWT), Google