Re: [Json] Proposal for strings/Unicode text

Bjoern Hoehrmann <derhoermi@gmx.net> Thu, 13 June 2013 10:15 UTC

Return-Path: <derhoermi@gmx.net>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B923A21F96FE for <json@ietfa.amsl.com>; Thu, 13 Jun 2013 03:15:32 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.513
X-Spam-Level:
X-Spam-Status: No, score=-2.513 tagged_above=-999 required=5 tests=[AWL=0.086, BAYES_00=-2.599]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4kmEarSe7o0W for <json@ietfa.amsl.com>; Thu, 13 Jun 2013 03:15:28 -0700 (PDT)
Received: from mout.gmx.net (mout.gmx.net [212.227.17.20]) by ietfa.amsl.com (Postfix) with ESMTP id 0E5FD21F96D9 for <json@ietf.org>; Thu, 13 Jun 2013 03:15:18 -0700 (PDT)
Received: from mailout-de.gmx.net ([10.1.76.19]) by mrigmx.server.lan (mrigmx001) with ESMTP (Nemesis) id 0Lh9yb-1U1vl40Ixb-00oU9H for <json@ietf.org>; Thu, 13 Jun 2013 12:15:17 +0200
Received: (qmail invoked by alias); 13 Jun 2013 10:15:16 -0000
Received: from p5B233947.dip0.t-ipconnect.de (EHLO netb.Speedport_W_700V) [91.35.57.71] by mail.gmx.net (mp019) with SMTP; 13 Jun 2013 12:15:16 +0200
X-Authenticated: #723575
X-Provags-ID: V01U2FsdGVkX197PAIrI257xIROvVfqKM5vxQKUTB904feWvHMnVZ sZz5TjbE1uj6DP
From: Bjoern Hoehrmann <derhoermi@gmx.net>
To: John Cowan <cowan@mercury.ccil.org>
Date: Thu, 13 Jun 2013 12:15:18 +0200
Message-ID: <jr5jr85h6pig2cr9id5hf1eh586g0u09i7@hive.bjoern.hoehrmann.de>
References: <CAHBU6ivNjMUwN2Hsn-E8FKxjqXS6b4qz=_MeeaHahWBWqG_Hgg@mail.gmail.com> <ED62F638-C0C4-411D-BA5B-EB9BA71EDB75@lindenbergsoftware.com> <20130613003213.GA26989@mercury.ccil.org>
In-Reply-To: <20130613003213.GA26989@mercury.ccil.org>
X-Mailer: Forte Agent 3.3/32.846
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
X-Y-GMX-Trusted: 0
Cc: "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Proposal for strings/Unicode text
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 13 Jun 2013 10:15:32 -0000

* John Cowan wrote:
>Officially, yes.  But surrogate code points cannot be inserted directly
>if the representation is UTF-8 (otherwise it becomes CESU-8 instead)
>or UTF-16 (otherwise it is broken UTF-16) or random non-Unicode encodings.
>So UTF-32 is the only encoding into which a surrogate code point can be
>inserted directly -- and nobody uses it.

http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf

  D31 UTF-32 encoding form:

  The Unicode encoding form which assigns each Unicode scalar value to a
  single unsigned 32-bit code unit with the same numeric value as the
  Unicode scalar value.

    • In UTF-32, the code point sequence <004D, 0430, 4E8C, 10302> is
      represented as <0000004D 00000430 00004E8C 00010302>.

    • Because surrogate code points are not included in the set of 
      Unicode scalar values, UTF-32 code units in the range
      0000D800_16 .. 0000DFFF_16 are ill-formed.

    • Any UTF-32 code unit greater than 0010FFFF_16 is ill-formed.

Note the second bullet point.
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/