Re: [Json] Encoding detection
ht@inf.ed.ac.uk (Henry S. Thompson) Thu, 14 November 2013 17:17 UTC
Return-Path: <ht@inf.ed.ac.uk>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5D85121E8098 for <json@ietfa.amsl.com>; Thu, 14 Nov 2013 09:17:50 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.986
X-Spam-Level:
X-Spam-Status: No, score=-6.986 tagged_above=-999 required=5 tests=[AWL=-0.387, BAYES_00=-2.599, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id P8pZpvuwZXGS for <json@ietfa.amsl.com>; Thu, 14 Nov 2013 09:17:46 -0800 (PST)
Received: from nougat.ucs.ed.ac.uk (nougat.ucs.ed.ac.uk [129.215.13.205]) by ietfa.amsl.com (Postfix) with ESMTP id 3F73C21E8056 for <json@ietf.org>; Thu, 14 Nov 2013 09:17:42 -0800 (PST)
Received: from crunchie.inf.ed.ac.uk (crunchie.inf.ed.ac.uk [129.215.33.180]) by nougat.ucs.ed.ac.uk (8.13.8/8.13.4) with ESMTP id rAEHHPLx000570; Thu, 14 Nov 2013 17:17:25 GMT
Received: from troutbeck.inf.ed.ac.uk (troutbeck.inf.ed.ac.uk [129.215.25.32]) by crunchie.inf.ed.ac.uk (8.14.4/8.14.4) with ESMTP id rAEHHNJc016176; Thu, 14 Nov 2013 17:17:23 GMT
Received: from troutbeck.inf.ed.ac.uk (localhost [127.0.0.1]) by troutbeck.inf.ed.ac.uk (8.14.4/8.14.4) with ESMTP id rAEHHOLk011035; Thu, 14 Nov 2013 17:17:24 GMT
Received: (from ht@localhost) by troutbeck.inf.ed.ac.uk (8.14.4/8.14.4/Submit) id rAEHHNbu011031; Thu, 14 Nov 2013 17:17:23 GMT
X-Authentication-Warning: troutbeck.inf.ed.ac.uk: ht set sender to ht@inf.ed.ac.uk using -f
To: "Joe Hildebrand (jhildebr)" <jhildebr@cisco.com>
References: <CEAA3067.2D132%jhildebr@cisco.com>
From: ht@inf.ed.ac.uk
Date: Thu, 14 Nov 2013 17:17:23 +0000
In-Reply-To: <CEAA3067.2D132%jhildebr@cisco.com> (Joe Hildebrand's message of "Thu\, 14 Nov 2013 14\:59\:57 +0000")
Message-ID: <f5bbo1mzyvw.fsf@troutbeck.inf.ed.ac.uk>
User-Agent: Gnus/5.101 (Gnus v5.10.10) XEmacs/21.5-b33 (linux)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Edinburgh-Scanned: at nougat.ucs.ed.ac.uk with MIMEDefang 2.60, Sophie, Sophos Anti-Virus, Clam AntiVirus
X-Scanned-By: MIMEDefang 2.60 on 129.215.13.205
Cc: "www-tag@w3.org" <www-tag@w3.org>, Paul Hoffman <paul.hoffman@vpnc.org>, Pete Cordell <petejson@codalogic.com>, JSON WG <json@ietf.org>
Subject: Re: [Json] Encoding detection
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Nov 2013 17:17:50 -0000
Joe Hildebrand (jhildebr) writes: > On 11/14/13 5:04 AM, "Pete Cordell" <petejson@codalogic.com> wrote: > >> In http://www.ietf.org/mail-archive/web/json/current/msg00565.html I >>mentioned that we also need to allow for characters such as U+2c00 to be >>the >>first character in a quoted string. > > Ah, yes. Sorry, I quoted from the wrong part of the conversation. I > completely agree. > >>That can be reduced a bit if we use "--" to indicate "not-tested": >> >> 00 00 -- -- UTF-32BE >> 00 xx -- -- UTF-16BE >> xx 00 00 00 UTF-32LE >> xx 00 00 xx UTF-16LE >> xx 00 xx -- UTF-16LE >> xx xx -- -- UTF-8 > > +1 to this table. It's clear, correct, and implementable. Doesn't work if you want to allow a pre-document BOM. But I think the following enlarged grid does work, where "--" now means "not any value(s) explicitly tested for with an 'earlier' shared prefix": 00 00 FE FF UTF-32 (BE) w. BOM [00 00 FF FE UCS-4 w. BOM, unusual octet order (2143)] 00 00 -- -- UTF-32BE 00 xx -- -- UTF-16BE FF FE 00 00 UTF-32 (LE) w. BOM [FE FF 00 00 UCS-4 w. BOM, unusual octet order (3412)] FE FF -- -- UTF-16 (BE) w. BOM FF FE -- -- UTF-16 (LE) w. BOM xx 00 00 00 UTF-32LE xx 00 00 xx UTF-16LE [Note that the following algorithm disagrees here] xx 00 xx -- UTF-16LE EF BB BF -- UTF-8 w. BOM xx xx -- -- UTF-8 A possibly simpler algorithm, which has the same outcome I think, minus the unusual octet cases and the exception noted above, is used by the Python requests package [1] for JSON charset detection. Schematically, this works as follows: FF FE 00 00 UTF-32 (LE) w. BOM 00 00 FE FF UTF-32 (BE) w. BOM EF BB BF UTF-8 w. BOM FE FF UTF-16 (BE) w. BOM FF FE UTF-16 (LE) w. BOM Now count 00 bytes in first 4: 0 UTF-8 2 00 -- 00 -- UTF-16 (BE) -- 00 -- 00 UTF-16 (LE) 3 00 00 00 -- UTF-32 (BE) -- 00 00 00 UTF-32 (LE) error I _think_ requests is only correct because it assumes "JSON always starts with two ASCII characters", depending on 4627, i.e. continuing to rule out e.g. "Ѐкземпияр" To accommodate this case, we would need to add 1 -- 00 -- -- UTF-16 (LE) 2 -- 00 00 -- UTF-16 (LE) to the requests algorithm. (There are, it has to be said, few Unicode characters whose UTF-16-L form is 00xx, i.e. U+xx00, the first code point on a code page -- I had to hunt pretty hard to find the above specimen, which is in fact a slight cheat :-) Many code pages have a gap at the 00 point. Not sure about the status of U+4E00, one variant of the ideograph for the numeral 1). ht [1] http://docs.python-requests.org/en/latest/ -- Henry S. Thompson, School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ [mail from me _always_ has a .sig like this mail without it is forged spam]
- Re: [Json] JSON: remove gap between Ecma-404 and … Paul Hoffman
- Re: [Json] JSON: remove gap between Ecma-404 and … Joe Hildebrand (jhildebr)
- Re: [Json] JSON: remove gap between Ecma-404 and … Carsten Bormann
- Re: [Json] JSON: remove gap between Ecma-404 and … Joe Hildebrand (jhildebr)
- Re: [Json] JSON: remove gap between Ecma-404 and … Tony Hansen
- Re: [Json] JSON: remove gap between Ecma-404 and … Paul Hoffman
- Re: [Json] JSON: remove gap between Ecma-404 and … Allen Wirfs-Brock
- Re: [Json] JSON: remove gap between Ecma-404 and … Carsten Bormann
- Re: [Json] JSON: remove gap between Ecma-404 and … Joe Hildebrand
- Re: [Json] JSON: remove gap between Ecma-404 and … Joe Hildebrand (jhildebr)
- Re: [Json] JSON: remove gap between Ecma-404 and … Paul Hoffman
- Re: [Json] JSON: remove gap between Ecma-404 and … John Cowan
- Re: [Json] JSON: remove gap between Ecma-404 and … Joe Hildebrand (jhildebr)
- Re: [Json] JSON: remove gap between Ecma-404 and … Mark Davis ☕
- Re: [Json] JSON: remove gap between Ecma-404 and … John Cowan
- Re: [Json] JSON: remove gap between Ecma-404 and … Allen Wirfs-Brock
- Re: [Json] JSON: remove gap between Ecma-404 and … Paul Hoffman
- Re: [Json] Encoding detection John Cowan
- Re: [Json] JSON: remove gap between Ecma-404 and … Martin J. Dürst
- [Json] Encoding detection (Was: Re: JSON: remove … Pete Cordell
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Joe Hildebrand (jhildebr)
- Re: [Json] JSON: remove gap between Ecma-404 and … John Cowan
- Re: [Json] JSON: remove gap between Ecma-404 and … Henry S. Thompson
- Re: [Json] JSON: remove gap between Ecma-404 and … Allen Wirfs-Brock
- Re: [Json] JSON: remove gap between Ecma-404 and … Allen Wirfs-Brock
- Re: [Json] Encoding detection Henry S. Thompson
- Re: [Json] Encoding detection John Cowan
- Re: [Json] Encoding detection David Booth
- Re: [Json] Encoding detection Martin J. Dürst
- Re: [Json] Encoding detection Henry S. Thompson
- Re: [Json] Encoding detection Pete Cordell
- Re: [Json] Encoding detection Paul Hoffman
- Re: [Json] Encoding detection Bjoern Hoehrmann
- Re: [Json] Encoding detection Paul Hoffman
- [Json] BOMs (Was: Re: JSON: remove gap between Ec… Pete Cordell
- Re: [Json] Encoding detection Pete Cordell
- Re: [Json] BOMs Henry S. Thompson
- Re: [Json] BOMs Martin J. Dürst
- Re: [Json] BOMs Bjoern Hoehrmann
- Re: [Json] BOMs Henry S. Thompson
- Re: [Json] BOMs Pete Cordell
- Re: [Json] BOMs Bjoern Hoehrmann
- Re: [Json] BOMs (Was: Re: JSON: remove gap betwee… Tim Bray
- Re: [Json] BOMs (Was: Re: JSON: remove gap betwee… Pete Cordell
- Re: [Json] BOMs John Cowan
- Re: [Json] BOMs (Was: Re: JSON: remove gap betwee… Julian Reschke
- Re: [Json] BOMs (Was: Re: JSON: remove gap betwee… John Cowan
- Re: [Json] BOMs (Was: Re: JSON: remove gap betwee… Pete Cordell
- Re: [Json] BOMs (Was: Re: JSON: remove gap betwee… Pete Cordell
- Re: [Json] BOMs (Was: Re: JSON: remove gap betwee… John Cowan
- Re: [Json] BOMs Tatu Saloranta
- Re: [Json] BOMs Phillip Hallam-Baker
- Re: [Json] BOMs Martin J. Dürst
- Re: [Json] BOMs Martin J. Dürst
- Re: [Json] BOMs Pete Cordell
- Re: [Json] BOMs Bjoern Hoehrmann
- Re: [Json] BOMs t.p.
- Re: [Json] BOMs Allen Wirfs-Brock
- Re: [Json] BOMs Tatu Saloranta
- Re: [Json] BOMs Henry S. Thompson
- Re: [Json] BOMs Joe Hildebrand (jhildebr)
- Re: [Json] BOMs Chris Lilley
- Re: [Json] BOMs Pete Cordell
- Re: [Json] BOMs Henry S. Thompson
- Re: [Json] BOMs Martin J. Dürst
- Re: [Json] BOMs Henry S. Thompson
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Henri Sivonen
- Re: [Json] Encoding detection (Was: Re: JSON: rem… John Cowan
- Re: [Json] BOMs Allen Wirfs-Brock
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Henri Sivonen
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Anne van Kesteren
- Re: [Json] BOMs Henri Sivonen
- Re: [Json] Encoding detection (Was: Re: JSON: rem… John Cowan
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Joe Hildebrand (jhildebr)
- Re: [Json] BOMs Allen Wirfs-Brock
- Re: [Json] BOMs Bjoern Hoehrmann
- Re: [Json] BOMs Mathias Bynens
- Re: [Json] BOMs John Cowan
- Re: [Json] BOMs Bjoern Hoehrmann
- Re: [Json] BOMs Nico Williams
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Henri Sivonen
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Pete Cordell
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Julian Reschke
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Robin Berjon
- Re: [Json] Encoding detection (Was: Re: JSON: rem… John Cowan
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Tim Bray
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Allen Wirfs-Brock
- [Json] Consensus on JSON-text (WAS: JSON: remove … Matt Miller (mamille2)
- Re: [Json] BOMs Matt Miller (mamille2)
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Matt Miller (mamille2)
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Julian Reschke
- Re: [Json] BOMs Bjoern Hoehrmann
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Bjoern Hoehrmann
- Re: [Json] Encoding detection (Was: Re: JSON: rem… John Cowan
- Re: [Json] Encoding detection (Was: Re: JSON: rem… John Cowan
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Matt Miller (mamille2)
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Pete Cordell
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Pete Cordell
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Martin J. Dürst
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Allen Wirfs-Brock
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Bjoern Hoehrmann
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Henri Sivonen
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Nico Williams
- Re: [Json] Encoding detection (Was: Re: JSON: rem… John Cowan
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Jacob Davies
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Nico Williams
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Julian Reschke
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Nico Williams
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Carsten Bormann
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Nico Williams
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Carsten Bormann
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Nico Williams
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Bjoern Hoehrmann
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Nico Williams
- Re: [Json] Encoding detection (Was: Re: JSON: rem… Henri Sivonen
- Re: [Json] Encoding detection (Was: Re: JSON: rem… John Cowan
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… Tim Bray
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… Alex Russell
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… Paul Hoffman
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… John Cowan
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… Carsten Bormann
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… Tim Bray
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… Phillip Hallam-Baker
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… Joe Hildebrand (jhildebr)
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… Carsten Bormann
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… Tim Berners-Lee
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… Carsten Bormann
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… Allen Wirfs-Brock
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… Noah Mendelsohn
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… Phillip Hallam-Baker
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… Ashok Malhotra
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… Nico Williams
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… Phillip Hallam-Baker
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… Phillip Hallam-Baker
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… John Cowan
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… Alex Russell
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… Nico Williams
- Re: [Json] Consensus on JSON-text (WAS: JSON: rem… John Cowan