Re: [hybi] Fragmented text message

Piotr Kulaga <piotrku@microsoft.com> Thu, 21 July 2011 16:04 UTC

From: Piotr Kulaga <piotrku@microsoft.com>
To: Takeshi Yoshino <tyoshino@google.com>, Brian <theturtle32@gmail.com>
Thread-Topic: [hybi] Fragmented text message
Thread-Index: AcxHgDh/v4NQDoJkRvu0Mk4+OYBXmwAPTumAAAE7A4AAAVHvYA==
Date: Thu, 21 Jul 2011 16:01:02 +0000
Message-ID: <ED13A76FCE9E96498B049688227AEA29388ADF4D@TK5EX14MBXC206.redmond.corp.microsoft.com>
References: <EC24CA2C319E8D47ACA5E181ABEC3E7B13BA5205BB@MCHP058A.global-ad.net> <CAE8AN_UmK-r2OskQG+QuRPgAWOg7S0BN6vfKLyDPPp2fAFDReQ@mail.gmail.com> <CAH9hSJYMJbswzpsnEmDz1CLF6bAKQQ954xyzrJ6=T1t4DoW4uw@mail.gmail.com>
In-Reply-To: <CAH9hSJYMJbswzpsnEmDz1CLF6bAKQQ954xyzrJ6=T1t4DoW4uw@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: multipart/alternative; boundary="_000_ED13A76FCE9E96498B049688227AEA29388ADF4DTK5EX14MBXC206r_"
MIME-Version: 1.0
Cc: "hybi@ietf.org" <hybi@ietf.org>
Subject: Re: [hybi] Fragmented text message
Precedence: list

My understanding is that each websocket entity that uses UTF-8 must have a valid UTF-8 stream as a payload. This covers all UTF-8 frames including continuation frames.

I slightly more prefer approach where UTF-8 message must contain valid sequence rather than each continuation frame (simpler for intermediaries, endpoints that stream data to application still must handle partial UTF-8 code point encoding case). Fine with any approach as long as it is well defined.

From: hybi-bounces@ietf.org [mailto:hybi-bounces@ietf.org] On Behalf Of Takeshi Yoshino
Sent: Thursday, July 21, 2011 2:22 AM
To: Brian
Cc: hybi@ietf.org
Subject: Re: [hybi] Fragmented text message

Agreed. Nothing in the spec disallows splitting UTF-8 byte sequence into separate frames.

Impatient receivers must use some intelligent UTF-8 decoder like Brian explained to get code points decoded. For example in python, codecs.getincrementaldecoder('utf-8')() does this. Maybe most of major platforms have library like it.

We may also add a constraint that the text message must be fragmented at UTF-8 byte sequence boundary, but it complicates fragmentation code. I'm not for that.

On Thu, Jul 21, 2011 at 17:47, Brian <theturtle32@gmail.com<mailto:theturtle32@gmail.com>> wrote:
Unless I'm mistaken, the fragmentation may occur in the middle of a multi-byte character sequence.  Your code should be aware of that when decoding.  My initial implementation buffers all fragments and then decodes the whole message into a string at once.  I imagine you could probably inspect the last four bytes of a fragment to determine whether there's a partial utf-8 character.  If there is, you could buffer just those few bytes and decode the rest of the fragment.  Then when the next fragment comes in, prepend those bytes to the new payload and continue. Depending on your use case and what you're optimizing for, it may be more efficient to just buffer the whole message and then decode.

Brian

On Thu, Jul 21, 2011 at 1:28 AM, Kukosa, Tomas <tomas.kukosa@siemens-enterprise.com<mailto:tomas.kukosa@siemens-enterprise.com>> wrote:
If the text message is fragmented must be each fragment a valid UTF-8 string or only complete defragmented message must be a valid UTF-8 string?
I.e. may I during receiving decode each fragment by UTF-8 and than join strings or do I need to receive all fragments and then decode only defragmented message?

[hybi] Fragmented text message Kukosa, Tomas
Re: [hybi] Fragmented text message Brian
Re: [hybi] Fragmented text message Takeshi Yoshino
Re: [hybi] Fragmented text message Piotr Kulaga
Re: [hybi] Fragmented text message David Endicott
Re: [hybi] Fragmented text message John Tamplin