Re: [Json] In "praise" of UTF-16

Carsten Bormann <cabo@tzi.org> Tue, 03 September 2019 05:32 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B8EAB1201AA for <json@ietfa.amsl.com>; Mon, 2 Sep 2019 22:32:53 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.197
X-Spam-Level:
X-Spam-Status: No, score=-4.197 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id RDkvmGKA30Ba for <json@ietfa.amsl.com>; Mon, 2 Sep 2019 22:32:51 -0700 (PDT)
Received: from gabriel-vm-2.zfn.uni-bremen.de (gabriel-vm-2.zfn.uni-bremen.de [134.102.50.17]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 6444A120020 for <json@ietf.org>; Mon, 2 Sep 2019 22:32:51 -0700 (PDT)
Received: from [192.168.217.110] (p548DCCB9.dip0.t-ipconnect.de [84.141.204.185]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gabriel-vm-2.zfn.uni-bremen.de (Postfix) with ESMTPSA id 46MwZb2ldMz1007; Tue, 3 Sep 2019 07:32:39 +0200 (CEST)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <CAChr6SwLw9srC-9jNMp8frNbr9gSrTDDY8p-Nv9PTgQhHmTjnQ@mail.gmail.com>
Date: Tue, 03 Sep 2019 07:32:28 +0200
Cc: Nico Williams <nico@cryptonector.com>, Anders Rundgren <anders.rundgren.net@gmail.com>, "json@ietf.org" <json@ietf.org>
X-Mao-Original-Outgoing-Id: 589181547.353497-29672c584ea53c3a1cf218b727074da2
Content-Transfer-Encoding: quoted-printable
Message-Id: <3BD0DBAF-21DA-46D0-9BEB-0141FDBCCDF0@tzi.org>
References: <cc3dc24d-3e13-e319-e48f-7b52ddd017d0@gmail.com> <00231270-86DF-4AD2-949E-25B04D518577@tzi.org> <20190902211744.GA7920@localhost> <40386571-301A-47BD-937D-55666566CFB5@tzi.org> <20190902214047.GB7920@localhost> <E387B935-8AA9-41E3-87D1-4EE72BB34BAE@tzi.org> <CAChr6SwLw9srC-9jNMp8frNbr9gSrTDDY8p-Nv9PTgQhHmTjnQ@mail.gmail.com>
To: Rob Sayre <sayrer@gmail.com>
X-Mailer: Apple Mail (2.3445.9.1)
Archived-At: <https://mailarchive.ietf.org/arch/msg/json/GHKaC2XChQwFjksR2M7QoztGSBE>
Subject: Re: [Json] In "praise" of UTF-16
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/json/>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 03 Sep 2019 05:32:54 -0000

On Sep 3, 2019, at 01:30, Rob Sayre <sayrer@gmail.com> wrote:
> 
> 
> 
> On Mon, Sep 2, 2019 at 2:49 PM Carsten Bormann <cabo@tzi.org> wrote:
> On Sep 2, 2019, at 23:40, Nico Williams <nico@cryptonector.com> wrote:
> > 
> > Yes, I'm aware.  It's not that bad.  It's not that bad for the other
> > camp either (since they must already have UTF-{8, 16} transliteration.
> 
> It’s trivial for the UTF-16 side because JSON needs conversion to UTF-8 already.
> Only if you then want to carry around the JSON as UTF-16 inside your program (which appears to be something that some Java people like) the whole thing becomes ugly.
> 
> Doesn’t this argument miss the escaping syntax that JSON requires?

A (sanely designed) canonical encoding won’t ever use that, so you don’t have to implement it even at the decoder.

> 'To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a 12-character sequence, encoding the UTF-16 surrogate pair.  So, for example, a string containing only the G clef character (U+1D11E) may be represented as “\uD834\uDD1E".'

Yes, but you never escape these.  The only ones that you do escape are the one you MUST:

   … the characters that MUST be escaped:
   quotation mark, reverse solidus, and the control characters (U+0000
   through U+001F).  [p. 8]

So, even if your string source is non-UTF-8, all escaping can be done *after* UTF-8 conversion, looking at the ASCII characters (high bit of the byte unset) only.

(For those listening in and not understanding what the disagreement could possibly be here:
It is a common misunderstanding of the cited paragraph at the start of page 9 of RFC 8259 that this mandates escaping astral code points.  No, it just says how you do it if you want to.  But you don’t want to.)

Grüße, Carsten