Re: [Json] In "praise" of UTF-16

Carsten Bormann <cabo@tzi.org> Mon, 02 September 2019 21:34 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 1B76F120096 for <json@ietfa.amsl.com>; Mon, 2 Sep 2019 14:34:56 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.198
X-Spam-Level:
X-Spam-Status: No, score=-4.198 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_NONE=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id CaRQB0WuoY0q for <json@ietfa.amsl.com>; Mon, 2 Sep 2019 14:34:54 -0700 (PDT)
Received: from gabriel-vm-2.zfn.uni-bremen.de (gabriel-vm-2.zfn.uni-bremen.de [134.102.50.17]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 87B6812004A for <json@ietf.org>; Mon, 2 Sep 2019 14:34:54 -0700 (PDT)
Received: from [192.168.217.110] (p548DCCB9.dip0.t-ipconnect.de [84.141.204.185]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gabriel-vm-2.zfn.uni-bremen.de (Postfix) with ESMTPSA id 46MjzK0wjPz10jQ; Mon, 2 Sep 2019 23:34:53 +0200 (CEST)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <20190902211744.GA7920@localhost>
Date: Mon, 02 Sep 2019 23:34:52 +0200
Cc: "json@ietf.org" <json@ietf.org>, Anders Rundgren <anders.rundgren.net@gmail.com>
X-Mao-Original-Outgoing-Id: 589152891.030826-af91668f3d5809e8733f30fca7f49eec
Content-Transfer-Encoding: quoted-printable
Message-Id: <40386571-301A-47BD-937D-55666566CFB5@tzi.org>
References: <cc3dc24d-3e13-e319-e48f-7b52ddd017d0@gmail.com> <00231270-86DF-4AD2-949E-25B04D518577@tzi.org> <20190902211744.GA7920@localhost>
To: Nico Williams <nico@cryptonector.com>
X-Mailer: Apple Mail (2.3445.9.1)
Archived-At: <https://mailarchive.ietf.org/arch/msg/json/prMvcesyZd4rZ-k_GPrTeAZVEb8>
Subject: Re: [Json] In "praise" of UTF-16
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/json/>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 02 Sep 2019 21:34:56 -0000

On Sep 2, 2019, at 23:17, Nico Williams <nico@cryptonector.com> wrote:
> 
> I'm loather to have to add
> support for transliterating to UTF-16 code for canonicalization
> purposes.

It’s not that bad.

If you have the code point cp, you can translate this to a comparison value by doing

(cp < 0x10000 ? cp << 16 : (cp << 10) + 0xD4000000)

(Untested code alert.  
This does not give you valid UTF-16, but something that will compare the same. 
I think.)

A similar transform can be applied to UTF-8 with a little state machine, but I’m too tired to code this up right now.

Grüße, Carsten