[Json] Encoding detection (Was: Re: JSON: remove gap between Ecma-404 and IETF draft)

"Pete Cordell" <petejson@codalogic.com> Thu, 14 November 2013 12:04 UTC

Return-Path: <petejson@codalogic.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 98BFF21E8090 for <json@ietfa.amsl.com>; Thu, 14 Nov 2013 04:04:27 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.486
X-Spam-Level:
X-Spam-Status: No, score=0.486 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, FH_HOST_EQ_D_D_D_D=0.765, HELO_MISMATCH_COM=0.553, RDNS_DYNAMIC=0.1, SARE_HEAD_XUNSENT=1.666, STOX_REPLY_TYPE=0.001]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id dACha9VIijw3 for <json@ietfa.amsl.com>; Thu, 14 Nov 2013 04:04:23 -0800 (PST)
Received: from ppsa-online.com (lvps217-199-162-192.vps.webfusion.co.uk [217.199.162.192]) by ietfa.amsl.com (Postfix) with ESMTP id E761621E81B2 for <json@ietf.org>; Thu, 14 Nov 2013 04:04:22 -0800 (PST)
Received: (qmail 27915 invoked from network); 14 Nov 2013 12:04:05 +0000
Received: from host81-129-187-193.range81-129.btcentralplus.com (HELO codalogic) (81.129.187.193) by lvps217-199-162-217.vps.webfusion.co.uk with ESMTPSA (RC4-MD5 encrypted, authenticated); 14 Nov 2013 12:04:05 +0000
Message-ID: <8413609C8A86497F856897AF2AA24960@codalogic>
From: Pete Cordell <petejson@codalogic.com>
To: Paul Hoffman <paul.hoffman@vpnc.org>, Joe Hildebrand Hildebrand <jhildebr@cisco.com>
References: <CEA93D51.2CE5A%joe@cursive.net>
X-Unsent: 1
Date: Thu, 14 Nov 2013 12:04:16 -0000
x-vipre-scanned: 00AC71FF005B3400AC734C
MIME-Version: 1.0
Content-Type: text/plain; format="flowed"; charset="iso-8859-1"; reply-type="original"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.5931
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157
Cc: www-tag@w3.org, JSON WG <json@ietf.org>
Subject: [Json] Encoding detection (Was: Re: JSON: remove gap between Ecma-404 and IETF draft)
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Nov 2013 12:04:27 -0000

Original Message From: "Joe Hildebrand" <hildjj@cursive.net>

> On 11/13/13 2:27 PM, "Paul Hoffman" <paul.hoffman@vpnc.org> wrote:
>
>><no hat>
>>
>>On Nov 13, 2013, at 12:24 PM, Joe Hildebrand (jhildebr)
>><jhildebr@cisco.com> wrote:
>>
>>> We would also need to change section 8.1 according to the mechanism that
>>> was previously proposed:
>>>
>>> 00 00 00 xx  UTF-32BE
>>>    00 xx ?? xx  UTF-16BE
>>>    xx 00 00 00  UTF-32LE
>>>    xx 00 xx ?? UTF-16LE
>>>    xx xx ?? ?? UTF-8
>>>
>>>
>>> in order to account for strings at the top level whose first character
>>>has
>>> a codepoint greater than 127.
>>
>>A string at the top level of a JSON text still needs to start with an
>>ASCII " character, so the logic is still fine, I believe.
>
>
> Without top level strings, the first *two* characters of any JSON text are
> always ASCII.  This:
>
>
> "?"  (that's U+0022 U+0100 U+0022)
>
> ...
>
> So the JSON text above would not match any of the table entries, causing
> an error.


 In http://www.ietf.org/mail-archive/web/json/current/msg00565.html I 
mentioned that we also need to allow for characters such as U+2c00 to be the 
first character in a quoted string.

This requires a pattern like:

    xx 00 00 xx  UTF-16LE

giving:

   00 00 00 xx  UTF-32BE
   00 xx 00 xx  UTF-16BE
   00 xx xx xx  UTF-16BE
   xx 00 00 00  UTF-32LE
   xx 00 00 xx  UTF-16LE
   xx 00 xx 00  UTF-16LE
   xx 00 xx xx  UTF-16LE
   xx xx xx xx  UTF-8

That can be reduced a bit if we use "--" to indicate "not-tested":

   00 00 -- --  UTF-32BE
   00 xx -- --  UTF-16BE
   xx 00 00 00  UTF-32LE
   xx 00 00 xx  UTF-16LE
   xx 00 xx --  UTF-16LE
   xx xx -- --  UTF-8


Pete Cordell
Codalogic Ltd
C++ tools for C++ programmers, http://codalogic.com
Read & write XML in C++, http://www.xml2cpp.com