Re: [Json] Call for Consensus: Proposed Text for "8.1 Character Encoding"

Peter Cordell <petejson@codalogic.com> Tue, 14 March 2017 08:46 UTC

Return-Path: <petejson@codalogic.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 75E9B129469 for <json@ietfa.amsl.com>; Tue, 14 Mar 2017 01:46:34 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.92
X-Spam-Level:
X-Spam-Status: No, score=-0.92 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RDNS_DYNAMIC=0.982, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id H16tIrJ_oUUh for <json@ietfa.amsl.com>; Tue, 14 Mar 2017 01:46:33 -0700 (PDT)
Received: from ppsa-online.com (lvps217-199-162-192.vps.webfusion.co.uk [217.199.162.192]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 32487129441 for <json@ietf.org>; Tue, 14 Mar 2017 01:46:33 -0700 (PDT)
Received: (qmail 6265 invoked from network); 14 Mar 2017 08:39:13 +0000
Received: from host109-158-230-32.range109-158.btcentralplus.com (HELO ?192.168.1.72?) (109.158.230.32) by lvps217-199-162-217.vps.webfusion.co.uk with ESMTPSA (DHE-RSA-AES128-SHA encrypted, authenticated); 14 Mar 2017 08:39:13 +0000
To: Julian Reschke <julian.reschke@gmx.de>, Matthew Miller <linuxwolf+ietf@outer-planes.net>, "json@ietf.org" <json@ietf.org>
References: <1fb5849e-8dbf-835d-65b7-2403686248f9@outer-planes.net> <b3cb2651-2d9f-d68d-2191-814e8dd5f5e2@gmx.de> <1cae01bf-721c-1fe0-46c2-8e82b5a043a7@codalogic.com>
From: Peter Cordell <petejson@codalogic.com>
Message-ID: <76b9f10c-9599-93af-546b-5769b83bdc9b@codalogic.com>
Date: Tue, 14 Mar 2017 08:46:28 +0000
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
MIME-Version: 1.0
In-Reply-To: <1cae01bf-721c-1fe0-46c2-8e82b5a043a7@codalogic.com>
Content-Type: text/plain; charset="windows-1252"; format="flowed"
Content-Transfer-Encoding: 7bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/json/JOEaX7yMOX7Kswlm4VEXUEhBt0c>
Cc: draft-ietf-jsonbis-rfc7159bis.all@ietf.org
Subject: Re: [Json] Call for Consensus: Proposed Text for "8.1 Character Encoding"
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/json/>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 14 Mar 2017 08:46:34 -0000

On 13/03/2017 22:44, Peter Cordell wrote:
> On 13/03/2017 21:50, Julian Reschke wrote:
>> I personally would prefer the table that I posted, as directly
>> identifies byte ordering as well:
>>
>>>            00 00 00 xx  UTF-32BE
>>>            00 xx 00 xx  UTF-16BE
>>>            xx 00 00 00  UTF-32LE
>>>            xx 00 xx 00  UTF-16LE
>>>            xx xx xx xx  UTF-8
>
>
> However, the table is not correct because the second 00 in the UTF-16
> cases could be non-zero when we have a string that starts with a
> non-ASCII range character.
>
> Maybe a table in which the first n-bytes match the follow can be used:
>
>     00 00        UTF-32BE
>     00 xx        UTF-16BE
>     xx 00 00     UTF-32LE
>     xx 00 xx     UTF-16LE
>     xx xx        UTF-8

Seems I need to correct my correction to allow for 0100;LATIN CAPITAL 
LETTER A WITH MACRON and so on:

     00 00        UTF-32BE
     00 xx        UTF-16BE
     xx 00 00 00  UTF-32LE
     xx 00 00 xx  UTF-16LE
     xx 00 xx     UTF-16LE
     xx xx        UTF-8
     xx EOF       UTF-8 (For Carsten's recent comment)

I think that's roughly where we got last time, and sufficient doubt had 
crept in that the UTF-8 should be used if unless the system knows the 
encoding is otherwise.

Pete Cordell
Codalogic Ltd
Rules for Describing JSON Content, http://json-content-rules.org