[Json] In "praise" of UTF-16

Anders Rundgren <anders.rundgren.net@gmail.com> Sat, 31 August 2019 15:39 UTC

Return-Path: <anders.rundgren.net@gmail.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B2A321200D7 for <json@ietfa.amsl.com>; Sat, 31 Aug 2019 08:39:27 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.999
X-Spam-Level:
X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id qVWa6yQp7EbK for <json@ietfa.amsl.com>; Sat, 31 Aug 2019 08:39:25 -0700 (PDT)
Received: from mail-wr1-x436.google.com (mail-wr1-x436.google.com [IPv6:2a00:1450:4864:20::436]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 81DCB1200CD for <json@ietf.org>; Sat, 31 Aug 2019 08:39:25 -0700 (PDT)
Received: by mail-wr1-x436.google.com with SMTP id j16so9773181wrr.8 for <json@ietf.org>; Sat, 31 Aug 2019 08:39:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:subject:to:message-id:date:user-agent:mime-version :content-transfer-encoding:content-language; bh=gRwemPHarf01BNTXr0IqH9oq4I5PHIxHvN6g4KsVav0=; b=ZJN6NSR10mGgi0tUCdXNYhYiUYqNXcCk6ZmjicUrM6dLrS/LnguE+OnLxtDMaT6/Dg ZMMrxixoMKwLQV2niRgoPcRaSboc1GxQ/WIGQVDsiJxMinxbt/RId23UHimsHLZfMXof VZ5nw3ArorWejBj7J7ZTRkiapx4PhpIdePUasAaL8/fGEQIY+BFA7th5CRLJg2YX+VTZ hhlvv1mTJ93ScVJcMIkRE0Ob+UyRoah7pKNgv4MB/xAOFxKJyWhTwb6L3Hx6xHJ/zpcY Jhymxbm90YxlpcAlTynAH0u6eMaroriVDLfvEzbHNEk7O9MlHEbkUOCiixL6kRacDfhj HoBA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:subject:to:message-id:date:user-agent :mime-version:content-transfer-encoding:content-language; bh=gRwemPHarf01BNTXr0IqH9oq4I5PHIxHvN6g4KsVav0=; b=c25nddq+0Mj+5QnoH/YsgVY3swiaGhSattq8g/UlNFI2093b2+NrAHsvDgHUyRj+ne OQl5VyNzUpybeL0n/qMJqGFJB0AnTc/EWXylqHA8ms79iD0zCbzEFFOtzOAiCS2bFjWb 2O8TNUhPNK/qCo/gXibQP0M4zRAB2yzIroronaqutUKoYltJy1yJzsEVOZnF8u8gvNUU 9T2kO+ao/W2AEwue1OD3p8RjscRFdLlKeXGzd1Zs0IDp1FGztuFkrcEVxLnnE1t93Zxu KbMzD0O+6JrMcqvaX3CHO3lvpFwNJg2tkPyQ9xJmpQgPL2Yg2BiWB7LLIw5rrzcFYkDm EMLA==
X-Gm-Message-State: APjAAAXzGTaj0Ck5sHTnh+6BWOEy9fmFf3d0K0VyPybbFBd7oCr+4ekk rdXN/SSaaHfgTUpbMtvgY1722bHn
X-Google-Smtp-Source: APXvYqzlv0S5GgIfrwObmxjR0GVLiu1G1u+r1dJ07K6VWDsovpHpEi8/Kda+rmGuSUsj/JZDld9Z8A==
X-Received: by 2002:a5d:4b41:: with SMTP id w1mr23727953wrs.23.1567265963654; Sat, 31 Aug 2019 08:39:23 -0700 (PDT)
Received: from [192.168.1.79] (25.131.146.77.rev.sfr.net. [77.146.131.25]) by smtp.googlemail.com with ESMTPSA id n14sm29495024wra.75.2019.08.31.08.39.22 for <json@ietf.org> (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 31 Aug 2019 08:39:22 -0700 (PDT)
From: Anders Rundgren <anders.rundgren.net@gmail.com>
To: "json@ietf.org" <json@ietf.org>
Message-ID: <cc3dc24d-3e13-e319-e48f-7b52ddd017d0@gmail.com>
Date: Sat, 31 Aug 2019 17:39:21 +0200
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Content-Language: en-US
Archived-At: <https://mailarchive.ietf.org/arch/msg/json/_Y7recPyZM7UvCkBSz4qXFNGZX4>
Subject: [Json] In "praise" of UTF-16
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/json/>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 31 Aug 2019 15:39:28 -0000

Hi JSON experts,
Pardon the subject line, I'm by no means an UTF-16 aficionado.

That UTF-16 has been deprecated by the industry at large for EXTERNAL representation of textual data is completely understandable.

However, an I-D dealing with canonical JSON serialization is currently in the IETF ISE queue got criticism for using UTF-16 encoding INTERNALLY for sorting properties/keys.
I don't see why since the only purpose of the sorting is creating a defined order.   That sorting on UTF-8 or UTF-32 would give another result is true but for the stated purpose that is of no importance.

In addition, JSON itself also depends on UTF-16 encoding for \uhhhh constants and AFAIK nobody have complained about that.
Example: A smiley Emoji has the Unicode value U+1F600 but would in a JSON escape sequence be represented as \ud83d\ude00

The reason for preferring UTF-16 in this particular case is simply because JavaScript, Windows and Java use UTF-16 as internal representation.  That's obviously a slight platform bias but the my Go and Python implementations show that the UTF-16 requirement in practice is a no-issue.

According to the Unicode standard UTF-16 belongs to the set of supported fully interchangeable encodings.

WDYT?

thanx,
Anders
https://tools.ietf.org/html/draft-rundgren-json-canonicalization-scheme-06