Re: [hybi] Fragmented text message

Piotr Kulaga <piotrku@microsoft.com> Thu, 21 July 2011 16:04 UTC

Return-Path: <piotrku@microsoft.com>
X-Original-To: hybi@ietfa.amsl.com
Delivered-To: hybi@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 1F37821F874A for <hybi@ietfa.amsl.com>; Thu, 21 Jul 2011 09:04:51 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -10.598
X-Spam-Level:
X-Spam-Status: No, score=-10.598 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-8]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LBxZU1BR7vHC for <hybi@ietfa.amsl.com>; Thu, 21 Jul 2011 09:04:49 -0700 (PDT)
Received: from smtp.microsoft.com (smtp.microsoft.com [131.107.115.215]) by ietfa.amsl.com (Postfix) with ESMTP id AE2A221F8A55 for <hybi@ietf.org>; Thu, 21 Jul 2011 09:04:49 -0700 (PDT)
Received: from TK5EX14HUBC105.redmond.corp.microsoft.com (157.54.80.48) by TK5-EXGWY-E802.partners.extranet.microsoft.com (10.251.56.168) with Microsoft SMTP Server (TLS) id 8.2.176.0; Thu, 21 Jul 2011 09:04:48 -0700
Received: from TK5EX14MBXC206.redmond.corp.microsoft.com ([169.254.4.121]) by TK5EX14HUBC105.redmond.corp.microsoft.com ([157.54.80.48]) with mapi id 14.01.0323.002; Thu, 21 Jul 2011 09:01:04 -0700
From: Piotr Kulaga <piotrku@microsoft.com>
To: Takeshi Yoshino <tyoshino@google.com>, Brian <theturtle32@gmail.com>
Thread-Topic: [hybi] Fragmented text message
Thread-Index: AcxHgDh/v4NQDoJkRvu0Mk4+OYBXmwAPTumAAAE7A4AAAVHvYA==
Date: Thu, 21 Jul 2011 16:01:02 +0000
Message-ID: <ED13A76FCE9E96498B049688227AEA29388ADF4D@TK5EX14MBXC206.redmond.corp.microsoft.com>
References: <EC24CA2C319E8D47ACA5E181ABEC3E7B13BA5205BB@MCHP058A.global-ad.net> <CAE8AN_UmK-r2OskQG+QuRPgAWOg7S0BN6vfKLyDPPp2fAFDReQ@mail.gmail.com> <CAH9hSJYMJbswzpsnEmDz1CLF6bAKQQ954xyzrJ6=T1t4DoW4uw@mail.gmail.com>
In-Reply-To: <CAH9hSJYMJbswzpsnEmDz1CLF6bAKQQ954xyzrJ6=T1t4DoW4uw@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [157.54.51.73]
Content-Type: multipart/alternative; boundary="_000_ED13A76FCE9E96498B049688227AEA29388ADF4DTK5EX14MBXC206r_"
MIME-Version: 1.0
Cc: "hybi@ietf.org" <hybi@ietf.org>
Subject: Re: [hybi] Fragmented text message
X-BeenThere: hybi@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Server-Initiated HTTP <hybi.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/hybi>, <mailto:hybi-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/hybi>
List-Post: <mailto:hybi@ietf.org>
List-Help: <mailto:hybi-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/hybi>, <mailto:hybi-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Jul 2011 16:04:51 -0000

My understanding is that each websocket entity that uses UTF-8 must have a valid UTF-8 stream as a payload. This covers all UTF-8 frames including continuation frames.

I slightly more prefer approach where UTF-8 message must contain valid sequence rather than each continuation frame (simpler for intermediaries, endpoints that stream data to application still must handle partial UTF-8 code point encoding case). Fine with any approach as long as it is well defined.

From: hybi-bounces@ietf.org [mailto:hybi-bounces@ietf.org] On Behalf Of Takeshi Yoshino
Sent: Thursday, July 21, 2011 2:22 AM
To: Brian
Cc: hybi@ietf.org
Subject: Re: [hybi] Fragmented text message

Agreed. Nothing in the spec disallows splitting UTF-8 byte sequence into separate frames.

Impatient receivers must use some intelligent UTF-8 decoder like Brian explained to get code points decoded. For example in python, codecs.getincrementaldecoder('utf-8')() does this. Maybe most of major platforms have library like it.

We may also add a constraint that the text message must be fragmented at UTF-8 byte sequence boundary, but it complicates fragmentation code. I'm not for that.

On Thu, Jul 21, 2011 at 17:47, Brian <theturtle32@gmail.com<mailto:theturtle32@gmail.com>> wrote:
Unless I'm mistaken, the fragmentation may occur in the middle of a multi-byte character sequence.  Your code should be aware of that when decoding.  My initial implementation buffers all fragments and then decodes the whole message into a string at once.  I imagine you could probably inspect the last four bytes of a fragment to determine whether there's a partial utf-8 character.  If there is, you could buffer just those few bytes and decode the rest of the fragment.  Then when the next fragment comes in, prepend those bytes to the new payload and continue. Depending on your use case and what you're optimizing for, it may be more efficient to just buffer the whole message and then decode.

Brian

On Thu, Jul 21, 2011 at 1:28 AM, Kukosa, Tomas <tomas.kukosa@siemens-enterprise.com<mailto:tomas.kukosa@siemens-enterprise.com>> wrote:
If the text message is fragmented must be each fragment a valid UTF-8 string or only complete defragmented message must be a valid UTF-8 string?
I.e. may I during receiving decode each fragment by UTF-8 and than join strings or do I need to receive all fragments and then decode only defragmented message?