Re: [Json] Encoding detection

John Cowan <cowan@mercury.ccil.org> Thu, 14 November 2013 18:30 UTC

Return-Path: <cowan@ccil.org>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 163E411E816B for <json@ietfa.amsl.com>; Thu, 14 Nov 2013 10:30:13 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.58
X-Spam-Level:
X-Spam-Status: No, score=-3.58 tagged_above=-999 required=5 tests=[AWL=0.019, BAYES_00=-2.599, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id uYMWxuYHOgvk for <json@ietfa.amsl.com>; Thu, 14 Nov 2013 10:30:08 -0800 (PST)
Received: from earth.ccil.org (earth.ccil.org [192.190.237.11]) by ietfa.amsl.com (Postfix) with ESMTP id 42FDC11E8143 for <json@ietf.org>; Thu, 14 Nov 2013 10:30:08 -0800 (PST)
Received: from cowan by earth.ccil.org with local (Exim 4.72) (envelope-from <cowan@ccil.org>) id 1Vh1fs-0007Z3-TB; Thu, 14 Nov 2013 13:29:52 -0500
Date: Thu, 14 Nov 2013 13:29:52 -0500
From: John Cowan <cowan@mercury.ccil.org>
To: "Henry S. Thompson" <ht@inf.ed.ac.uk>
Message-ID: <20131114182952.GF2165@mercury.ccil.org>
References: <CEAA3067.2D132%jhildebr@cisco.com> <f5bbo1mzyvw.fsf@troutbeck.inf.ed.ac.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <f5bbo1mzyvw.fsf@troutbeck.inf.ed.ac.uk>
User-Agent: Mutt/1.5.20 (2009-06-14)
Sender: John Cowan <cowan@ccil.org>
Cc: Pete Cordell <petejson@codalogic.com>, JSON WG <json@ietf.org>, "www-tag@w3.org" <www-tag@w3.org>, "Joe Hildebrand (jhildebr)" <jhildebr@cisco.com>, Paul Hoffman <paul.hoffman@vpnc.org>
Subject: Re: [Json] Encoding detection
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Nov 2013 18:30:13 -0000

Henry S. Thompson scripsit:

> (There are, it has to be said, few Unicode characters whose UTF-16-L
> form is 00xx, i.e. U+xx00, the first code point on a code page --
> I had to hunt pretty hard to find the above specimen, which is in
> fact a slight cheat :-) Many code pages have a gap at the 00 point.

There are 68 of them on the Basic Multilingual Plane.  But many
characters in other planes involve such 16-bit code units.  For example,
all of U+10000 to U+103FF are encoded as D800 DC00 through D800 DFFF.
Currently there are 622 characters in this range alone, and the number
will probably grow.

> Not sure about the status of U+4E00, one variant of the ideograph for
> the numeral 1).

Google reports over 3 gigahits for this character.

-- 
John Cowan          http://www.ccil.org/~cowan        cowan@ccil.org
To say that Bilbo's breath was taken away is no description at all.  There are
no words left to express his staggerment, since Men changed the language that
they learned of elves in the days when all the world was wonderful. --The Hobbit