Re: [Json] Canonicalization

"Manger, James H" <James.H.Manger@team.telstra.com> Wed, 20 February 2013 00:57 UTC

Return-Path: <James.H.Manger@team.telstra.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6587C21F85DA for <json@ietfa.amsl.com>; Tue, 19 Feb 2013 16:57:58 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.901
X-Spam-Level:
X-Spam-Status: No, score=-0.901 tagged_above=-999 required=5 tests=[AWL=0.000, BAYES_00=-2.599, HELO_EQ_AU=0.377, HOST_EQ_AU=0.327, RELAY_IS_203=0.994]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id d2yJDbwm7ZvL for <json@ietfa.amsl.com>; Tue, 19 Feb 2013 16:57:58 -0800 (PST)
Received: from ipxavo.tcif.telstra.com.au (ipxavo.tcif.telstra.com.au [203.35.135.200]) by ietfa.amsl.com (Postfix) with ESMTP id 879FA21F85D9 for <json@ietf.org>; Tue, 19 Feb 2013 16:57:57 -0800 (PST)
X-IronPort-AV: E=Sophos;i="4.84,698,1355058000"; d="scan'208";a="118875932"
Received: from unknown (HELO ipcdvi.tcif.telstra.com.au) ([10.97.217.212]) by ipoavi.tcif.telstra.com.au with ESMTP; 20 Feb 2013 11:57:56 +1100
X-IronPort-AV: E=McAfee;i="5400,1158,6991"; a="113025009"
Received: from wsmsg3702.srv.dir.telstra.com ([172.49.40.170]) by ipcdvi.tcif.telstra.com.au with ESMTP; 20 Feb 2013 11:57:56 +1100
Received: from WSMSG3153V.srv.dir.telstra.com ([172.49.40.159]) by WSMSG3702.srv.dir.telstra.com ([172.49.40.170]) with mapi; Wed, 20 Feb 2013 11:57:55 +1100
From: "Manger, James H" <James.H.Manger@team.telstra.com>
To: "json@ietf.org" <json@ietf.org>
Date: Wed, 20 Feb 2013 11:57:54 +1100
Thread-Topic: Canonicalization
Thread-Index: AQHODuJJD2QYqWyDNUmeQU+v/KxthpiB3nLA
Message-ID: <255B9BB34FB7D647A506DC292726F6E11507579808@WSMSG3153V.srv.dir.telstra.com>
References: <BF7E36B9C495A6468E8EC573603ED9411513E818@xmb-aln-x11.cisco.com> <A723FC6ECC552A4D8C8249D9E07425A70F897263@xmb-rcd-x10.cisco.com>
In-Reply-To: <A723FC6ECC552A4D8C8249D9E07425A70F897263@xmb-rcd-x10.cisco.com>
Accept-Language: en-US, en-AU
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
acceptlanguage: en-US, en-AU
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
MIME-Version: 1.0
Subject: Re: [Json] Canonicalization
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "Discussion related to JavaScript Object Notation \(JSON\)." <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Feb 2013 00:57:58 -0000

We should avoid the need to canonicalize JSON whenever possible, but there are enough efforts to define a canonical form that it would be worth standardizing one.

Strings:
Escaping is mandatory for controls chars, double quote, and backslash (%x00-1F / %x22 / %x5C).
The simplest string canonicalization rule would be to escape those 34 chars and no others. A rule that might be slightly nicer for people (and hence worth the few extra lines of code) would be to always use the 7 \x escapes for those 7 chars, always use \uxxxx for the rest of %x00-1F, and never use it for any other chars. I would be tempted to drop \/ from the list of seven (only in the canonicalization rules), because / is so common in URIs etc.

Objects:
Sorting string names to canonicalize an object needs a few more words.
Presumably sorting occurs on the logical strings, not the (canonically) encoded versions. So {"a\"b":1,"a#b":2} is in canonical form ("(U+22) < #(U+23) < \(U+5C)).
Presumably sorting uses Unicode code points, not UTF-16 words, or UTF-8 bytes. So {"\uFFE0\uFFE1":3,"\uD834\uDD1E":4} is in the right order (0xFFE0 < 0x01D11E), thought the canonical form would use UTF-8 not \uxxxx for the 3 characters.
[The 21-byte canonical form would be (in hex):
7B 22 EFBFA0 EFBFA1 22 3A 33 2C 22 F09D849E 22 3A 34 7D]

Numbers:
0, 1, 1e3, 2.334e-5, 1.5e6 are the sort of canonical form numbers should have. I don’t like 0.0E0 for zero as per draft-staykov-hu-json-canonical-form-00 -- a person would never write that. That draft also allows any number of trailing 0’s (eg 1.200000e-2), which is a bug. A canonical form should drop the exponent when it is zero, and drop the decimal point when there is nothing after it.
A regex for numbers in canonical form:

0|-?[1-9](\.[0-9]*[1-9])?(e-?[1-9][0-9]*)?


So draft-staykov-hu-json-canonical-form needs a few changes in my mind, but is as good a starting point as any.

--
James Manger