Re: [Json] Call for Consensus: Proposed Text for "8.1 Character Encoding"

Carsten Bormann <cabo@tzi.org> Mon, 13 March 2017 21:12 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 8872E129B31; Mon, 13 Mar 2017 14:12:14 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.199
X-Spam-Level:
X-Spam-Status: No, score=-4.199 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id BKRuymLlb77z; Mon, 13 Mar 2017 14:12:13 -0700 (PDT)
Received: from mailhost.informatik.uni-bremen.de (mailhost.informatik.uni-bremen.de [IPv6:2001:638:708:30c9::12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 0188A129B8A; Mon, 13 Mar 2017 14:12:12 -0700 (PDT)
X-Virus-Scanned: amavisd-new at informatik.uni-bremen.de
Received: from submithost.informatik.uni-bremen.de (submithost.informatik.uni-bremen.de [IPv6:2001:638:708:30c9::b]) by mailhost.informatik.uni-bremen.de (8.14.5/8.14.5) with ESMTP id v2DLCAbO016077; Mon, 13 Mar 2017 22:12:10 +0100 (CET)
Received: from [192.168.217.124] (p5DCCCDC2.dip0.t-ipconnect.de [93.204.205.194]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by submithost.informatik.uni-bremen.de (Postfix) with ESMTPSA id 3vhrCt00MzzDGym; Mon, 13 Mar 2017 22:12:09 +0100 (CET)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 10.2 \(3259\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <1fb5849e-8dbf-835d-65b7-2403686248f9@outer-planes.net>
Date: Mon, 13 Mar 2017 22:12:09 +0100
X-Mao-Original-Outgoing-Id: 511132328.720784-06aad7f5314970c3a36ebab486fc59c7
Content-Transfer-Encoding: quoted-printable
Message-Id: <3B3F2181-6C5D-43C0-BCD9-8D4BA05E6C03@tzi.org>
References: <1fb5849e-8dbf-835d-65b7-2403686248f9@outer-planes.net>
To: Matthew Miller <linuxwolf+ietf@outer-planes.net>
X-Mailer: Apple Mail (2.3259)
Archived-At: <https://mailarchive.ietf.org/arch/msg/json/8eT6j9ffGvWO7YitYZtdmrwK2ho>
Cc: draft-ietf-jsonbis-rfc7159bis.all@ietf.org, "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Call for Consensus: Proposed Text for "8.1 Character Encoding"
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/json/>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 13 Mar 2017 21:12:14 -0000

For the record…

> On 13 Mar 2017, at 22:06, Matthew Miller <linuxwolf+ietf@outer-planes.net> wrote:
> 
> Hello JSONbis,
> 
> The security directorate review discussion has raised the issue of
> encoding detection.  The original table from RFC 4627 was removed from
> RFC 7159 due to a lack of consensus.  In this latest round, there have
> been a number of comments have been made supporting (and against) adding
> more guidance than is currently present.
> 
> The chair asks for a call on the following from the working group:
> 
> 1) Does the working group think adding any text on how to detect the
> encoding worthwhile?

No, that would be a regression into maintaining the fiction that UTF-16 and UTF-32 versions of JSON are being used in interchange.

> 2a) If such text is worthwhile, is the following proposed text from Nico
> Williams acceptable (to be appended to Section 8.1)?
> 
> """
>   Implementors MAY count the number of ASCII NULs in the first four
>   bytes of any JSON text to detect which of UTF-8, UTF-16, or UTF-32
>   the text is encoded in:
> 
>    - if the count is zero, then the text is encoded in UTF-8
>    - if the count is one or two, then the text is encoded in UTF-16
>    - if the count is three, then the text is encoded in UTF-32
> 
>   This results from a) JSON texts having to start with an ASCII
>   character, b) no unescaped NULs being allowed in JSON strings, and c)
>   any type being allowed at the top-level, thus the first character may
>   be a double-quote and the second may be any permissible, unescaped
>   Unicode codepoint.  An ASCII character requires a NUL-valued byte in
>   UTF-16 encoding, three in UTF-32, and none in UTF-8.
> 
> “""

Not sure if I’m allowed to note that after saying no above, but not all JSON documents have four bytes.

> 2b) If such text is worthwhile but Nico's proposal is not worthwhile,
> what would be acceptable?

Again, not worthwhile, but maybe it wouldn’t hurt to mention that implementations that want to guard against erroneously encoded input can detect ASCII NULs in the input and even use those to predict whether the encoder was using one of the UTF-16s or one of the UTF-32s.

Grüße, Carsten