Re: [Json] In "praise" of UTF-16

Carsten Bormann <cabo@tzi.org> Mon, 02 September 2019 21:00 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C9A8412022E for <json@ietfa.amsl.com>; Mon, 2 Sep 2019 14:00:04 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.198
X-Spam-Level:
X-Spam-Status: No, score=-4.198 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_NONE=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id UKXlrNlifdFS for <json@ietfa.amsl.com>; Mon, 2 Sep 2019 14:00:02 -0700 (PDT)
Received: from gabriel-vm-2.zfn.uni-bremen.de (gabriel-vm-2.zfn.uni-bremen.de [134.102.50.17]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 70F091201CE for <json@ietf.org>; Mon, 2 Sep 2019 14:00:02 -0700 (PDT)
Received: from [192.168.217.110] (p548DCCB9.dip0.t-ipconnect.de [84.141.204.185]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gabriel-vm-2.zfn.uni-bremen.de (Postfix) with ESMTPSA id 46MjC43Z1tz100j; Mon, 2 Sep 2019 23:00:00 +0200 (CEST)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <cc3dc24d-3e13-e319-e48f-7b52ddd017d0@gmail.com>
Date: Mon, 02 Sep 2019 22:59:59 +0200
Cc: "json@ietf.org" <json@ietf.org>
X-Mao-Original-Outgoing-Id: 589150796.295164-2a91f7718b3e95a4fe8dbc552df1a78b
Content-Transfer-Encoding: quoted-printable
Message-Id: <00231270-86DF-4AD2-949E-25B04D518577@tzi.org>
References: <cc3dc24d-3e13-e319-e48f-7b52ddd017d0@gmail.com>
To: Anders Rundgren <anders.rundgren.net@gmail.com>
X-Mailer: Apple Mail (2.3445.9.1)
Archived-At: <https://mailarchive.ietf.org/arch/msg/json/atI1VncQ3iMYcSV0sKYAk1lPlWg>
Subject: Re: [Json] In "praise" of UTF-16
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/json/>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 02 Sep 2019 21:00:12 -0000

Hi Anders,

> On Aug 31, 2019, at 17:39, Anders Rundgren <anders.rundgren.net@gmail.com> wrote:
> 
> I-D dealing with canonical JSON serialization is currently in the IETF ISE queue

I don’t really care much about this draft, as I have pointed out.
(Mainly not because deterministic encoding is wrong, but because the reasons you want to define a “standard” for it are wrong.  But that has been said already in different messages.)

So please do go ahead and taint your scheme further with UTF-16 legacy stuff.

For those people who missed the early history of ISO 10646 (aka Unicode):
Unicode was supposed to be the 16-bit character set to end all 8-bit character sets.
We were all supposed to move to UCS-2 (a 2-byte, 16-bit representation where the standards bodies couldn’t even decide the endianness, so we actually have two of those).

One important language designed in the early-to-mid-1990s got stuck with this: Java.  As did Windows.
Pioneers get shot full of arrows.  
The .NET system designed for Windows as a competitor to the JVM (Java Virtual Machine), as well as JavaScript (which is not Java but usurped the name), also inherited 16-bit characters.

Then it dawned people that 16 bits were not enough, and after a short excursion to a 31-bit system, UTF-16 was invented as a hack to accept more than 63488 [sic] codepoints, with the built-in problem that the “astral” codepoints (beyond 2**16) were infrequent enough that people built endless amounts of bugs into software that used it.  

The 16-bit legacy embedded into Java and its contemporaries could not be stamped out since.  The main reason is that while UTF-16 is a hack that hurts, it does not hurt enough to sustain deployment of a fix (remember that even Python is still struggling this very year with getting its Unicode fixes deployed).

JSON of course inherited the 16-bit legacy from JavaScript in its string escape scheme, but otherwise did come out pretty much unharmed and managed to focus on UTF-8, like everything else relevant to interchange.

Defining a deterministic encoding scheme (“canonicalization”) for JSON in 2019 that needs a detour through UTF-16-land looks like a cruel joke.  If you only ever care about Java and its contemporaries, it may actually seem practical to you.  Given that most map keys will be ASCII anyway, people will certainly find shortcuts (also known as sleeping interoperability problems) around the issue.  A protocol designer would not touch this with a 16-foot pole.

Grüße, Carsten