Re: [hybi] Fragmented text message

Takeshi Yoshino <tyoshino@google.com> Thu, 21 July 2011 09:22 UTC

Return-Path: <tyoshino@google.com>
X-Original-To: hybi@ietfa.amsl.com
Delivered-To: hybi@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B373F21F8BB0 for <hybi@ietfa.amsl.com>; Thu, 21 Jul 2011 02:22:47 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -105.976
X-Spam-Level:
X-Spam-Status: No, score=-105.976 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-4, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id a9oKt8BX7z2X for <hybi@ietfa.amsl.com>; Thu, 21 Jul 2011 02:22:47 -0700 (PDT)
Received: from smtp-out.google.com (smtp-out.google.com [216.239.44.51]) by ietfa.amsl.com (Postfix) with ESMTP id 126C621F8BAD for <hybi@ietf.org>; Thu, 21 Jul 2011 02:22:47 -0700 (PDT)
Received: from wpaz29.hot.corp.google.com (wpaz29.hot.corp.google.com [172.24.198.93]) by smtp-out.google.com with ESMTP id p6L9Mklq010152 for <hybi@ietf.org>; Thu, 21 Jul 2011 02:22:46 -0700
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=google.com; s=beta; t=1311240166; bh=atOZKkIAL2zQpNz1Q2xMAWe3/hc=; h=MIME-Version:In-Reply-To:References:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=Rix72a/9b6+bG9IszQI7us8scDLZZlOOyouK71w9btwa2sWCDKX1DLE4s6vLXOUsY aFohAqPAUF9SVxkrRpLhw==
DomainKey-Signature: a=rsa-sha1; s=beta; d=google.com; c=nofws; q=dns; h=dkim-signature:mime-version:in-reply-to:references:from:date: message-id:subject:to:cc:content-type:x-system-of-record; b=giyhdiNsm4BZ5qy0cF6zV8C4Pg3OghbKGBhqy89/pfATD5f2zUgXEuU4w3CHWtDwp Yy0mzy2qpqTF48WeAei8g==
Received: from ywn13 (ywn13.prod.google.com [10.192.14.13]) by wpaz29.hot.corp.google.com with ESMTP id p6L9MjYS018058 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT) for <hybi@ietf.org>; Thu, 21 Jul 2011 02:22:45 -0700
Received: by ywn13 with SMTP id 13so596876ywn.7 for <hybi@ietf.org>; Thu, 21 Jul 2011 02:22:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=beta; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=NX+3KhBtKiHJ3Mdbo9O7sh/+b9netaSuImqpUobiYyo=; b=WtQ8MTGS2ZthLLTItCYz8pROrVzVgvHLz4MxLPWQwpW9tzTOrT3+MZ69h/hg+FMZq5 loIQ3ZSPSyvyr2yGnbfQ==
Received: by 10.150.253.14 with SMTP id a14mr383375ybi.0.1311240165208; Thu, 21 Jul 2011 02:22:45 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.150.178.16 with HTTP; Thu, 21 Jul 2011 02:22:25 -0700 (PDT)
In-Reply-To: <CAE8AN_UmK-r2OskQG+QuRPgAWOg7S0BN6vfKLyDPPp2fAFDReQ@mail.gmail.com>
References: <EC24CA2C319E8D47ACA5E181ABEC3E7B13BA5205BB@MCHP058A.global-ad.net> <CAE8AN_UmK-r2OskQG+QuRPgAWOg7S0BN6vfKLyDPPp2fAFDReQ@mail.gmail.com>
From: Takeshi Yoshino <tyoshino@google.com>
Date: Thu, 21 Jul 2011 18:22:25 +0900
Message-ID: <CAH9hSJYMJbswzpsnEmDz1CLF6bAKQQ954xyzrJ6=T1t4DoW4uw@mail.gmail.com>
To: Brian <theturtle32@gmail.com>
Content-Type: multipart/alternative; boundary=000e0cd306b643318704a890e424
X-System-Of-Record: true
Cc: "hybi@ietf.org" <hybi@ietf.org>
Subject: Re: [hybi] Fragmented text message
X-BeenThere: hybi@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Server-Initiated HTTP <hybi.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/hybi>, <mailto:hybi-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/hybi>
List-Post: <mailto:hybi@ietf.org>
List-Help: <mailto:hybi-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/hybi>, <mailto:hybi-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Jul 2011 09:22:47 -0000

Agreed. Nothing in the spec disallows splitting UTF-8 byte sequence into
separate frames.

Impatient receivers must use some intelligent UTF-8 decoder like Brian
explained to get code points decoded. For example in python,
codecs.getincrementaldecoder('utf-8')() does this. Maybe most of major
platforms have library like it.

We may also add a constraint that the text message must be fragmented at
UTF-8 byte sequence boundary, but it complicates fragmentation code. I'm not
for that.

On Thu, Jul 21, 2011 at 17:47, Brian <theturtle32@gmail.com> wrote:

> Unless I'm mistaken, the fragmentation may occur in the middle of a
> multi-byte character sequence.  Your code should be aware of that when
> decoding.  My initial implementation buffers all fragments and then decodes
> the whole message into a string at once.  I imagine you could probably
> inspect the last four bytes of a fragment to determine whether there's a
> partial utf-8 character.  If there is, you could buffer just those few bytes
> and decode the rest of the fragment.  Then when the next fragment comes in,
> prepend those bytes to the new payload and continue. Depending on your use
> case and what you're optimizing for, it may be more efficient to just buffer
> the whole message and then decode.
>
> Brian
>
>
> On Thu, Jul 21, 2011 at 1:28 AM, Kukosa, Tomas <
> tomas.kukosa@siemens-enterprise.com> wrote:
>
>> If the text message is fragmented must be each fragment a valid UTF-8
>> string or only complete defragmented message must be a valid UTF-8 string?
>> I.e. may I during receiving decode each fragment by UTF-8 and than join
>> strings or do I need to receive all fragments and then decode only
>> defragmented message?
>>
>