Re: [Json] First two characters

Nico Williams <nico@cryptonector.com> Fri, 22 November 2013 22:22 UTC

Return-Path: <nico@cryptonector.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C18F11AE366 for <json@ietfa.amsl.com>; Fri, 22 Nov 2013 14:22:10 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2
X-Spam-Level:
X-Spam-Status: No, score=-2 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id mQoJEh_sbfD1 for <json@ietfa.amsl.com>; Fri, 22 Nov 2013 14:22:09 -0800 (PST)
Received: from homiemail-a86.g.dreamhost.com (caiajhbdccac.dreamhost.com [208.97.132.202]) by ietfa.amsl.com (Postfix) with ESMTP id 31B0C1AE35E for <json@ietf.org>; Fri, 22 Nov 2013 14:22:09 -0800 (PST)
Received: from homiemail-a86.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a86.g.dreamhost.com (Postfix) with ESMTP id 42C053600D5; Fri, 22 Nov 2013 14:22:00 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=cryptonector.com; h=date :from:to:cc:subject:message-id:references:mime-version :content-type:in-reply-to; s=cryptonector.com; bh=j+zARBXNgRG3iF U1KoUDsGR6fHQ=; b=mgccfL7n2YdIsEhDdbnk08uPXMHK2hZ450fq2CNWwwWWDC KgdmSLE8GtKHnHywut94pLVeapkFnq4GiRNqEFHeovGYfXg56hMjBotxqtSmMUD1 N/kCZ7aA1tN8Nw4VwSvqU2dI4pNRg0R/SCCP7hqN02M53CbjHe3EQoKKeV+4Q=
Received: from localhost (108-207-244-174.lightspeed.austtx.sbcglobal.net [108.207.244.174]) (Authenticated sender: nico@cryptonector.com) by homiemail-a86.g.dreamhost.com (Postfix) with ESMTPA id 2705A3600DB; Fri, 22 Nov 2013 14:19:43 -0800 (PST)
Date: Fri, 22 Nov 2013 16:19:41 -0600
From: Nico Williams <nico@cryptonector.com>
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Message-ID: <20131122221940.GD3655@localhost>
References: <v8av89128j49csd5bb5ba2rqrgschs4c79@hive.bjoern.hoehrmann.de>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <v8av89128j49csd5bb5ba2rqrgschs4c79@hive.bjoern.hoehrmann.de>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: json@ietf.org
Subject: Re: [Json] First two characters
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json/>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 22 Nov 2013 22:22:11 -0000

On Fri, Nov 22, 2013 at 07:59:36PM +0100, Bjoern Hoehrmann wrote:
>   Now that string literals can occur at the top level, the statement in
> 8.1 http://tools.ietf.org/html/draft-ietf-json-rfc4627bis-07 "the first
> two characters of a JSON text will always be ASCII characters" is
> incorrect and needs to be changed alongside changing the JSON-text
> grammar.

Good point.  Using 'xx' to mean 0x01..0x7F, XX to mean 0x01..0xFF, YY to
mean 0x80..0xFF, ZZ to mean 0x00..0xFF, and 00 to mean 0x00, we have
these possible BOM-less patterns:

   xx 00 00 00  UTF-32LE
   00 00 00 xx  UTF-32BE
   xx 00 ZZ ZZ  UTF-16LE
         but ZZ bytes cannot both be 00
   00 xx ZZ ZZ  UTF-16BE
         but ZZ bytes cannot both be 00
   xx XX        UTF-8

Simplified to assume sequential checking:

   00 00 00 xx  UTF-32BE
   00 xx        UTF-16BE
   xx 00 00 00  UTF-32LE
   xx 00        UTF-16LE
   xx           UTF-8

Optimized for the common cases:

   xx XX        UTF-8    (XX > 0)
   xx 00 ZZ ZZ  UTF-16LE (ZZ can be any value, but both ZZs can't be 0)
   00 xx        UTF-16BE
   00 00 00 xx  UTF-32BE
   xx 00 00 00  UTF-32LE

Nico
--