Re: [Json] Proposal for strings/Unicode text

John Cowan <> Thu, 13 June 2013 18:20 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id D3D5621F8EB3 for <>; Thu, 13 Jun 2013 11:20:11 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -3.139
X-Spam-Status: No, score=-3.139 tagged_above=-999 required=5 tests=[AWL=-0.140, BAYES_00=-2.599, J_CHICKENPOX_14=0.6, RCVD_IN_DNSWL_LOW=-1]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id CQB5JKUX2hmH for <>; Thu, 13 Jun 2013 11:19:57 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id A213A21F881F for <>; Thu, 13 Jun 2013 11:19:57 -0700 (PDT)
Received: from cowan by with local (Exim 4.72) (envelope-from <>) id 1UnC7n-0004iT-HR; Thu, 13 Jun 2013 14:19:55 -0400
Date: Thu, 13 Jun 2013 14:19:55 -0400
From: John Cowan <>
To: Tim Bray <>
Message-ID: <>
References: <> <> <> <> <> <>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <>
User-Agent: Mutt/1.5.20 (2009-06-14)
Sender: John Cowan <>
Cc: Bjoern Hoehrmann <>, "" <>
Subject: Re: [Json] Proposal for strings/Unicode text
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Thu, 13 Jun 2013 18:20:13 -0000

Tim Bray scripsit:

> Why?  UTF-8 is perfectly capable of representing those integers.  Yes, the
> spec says that You Shouldn’t Do That, but it says the same thing about
> unpaired surrogates in UTF-16.  

If this is going to be just another informational RFC, then of course
we can do what we want.  But if it's a standards-track RFC, it has to
play nicely with other standards-track RFCs.  And RFC 3629 aka STD 63 is
not just standards-track, it's an Internet Standard, it's ten years old,
and it says uncompromisingly:

   The definition of UTF-8 prohibits encoding character numbers between
   U+D800 and U+DFFF, which are reserved for use with the UTF-16
   encoding form (as surrogate pairs) and do not directly represent
   characters.  When encoding in UTF-8 from UTF-16 data, it is necessary
   to first decode the UTF-16 data to obtain character numbers, which
   are then encoded in UTF-8 as described above.  This contrasts with
   CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
   use on the Internet.  CESU-8 operates similarly to UTF-8 but encodes
   the UTF-16 code values (16-bit quantities) instead of the character
   number (code point).  This leads to different results for character
   numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT
   valid UTF-8.

The Unicode definition today is the same, though in the past it's been
more wishy-washy. (CESU-8, BTW, is the official name for Oracle's and
MySQL's "UTF-8" encoding for database strings: the real thing is called
"AL32UTF8" and "utf8mb4" in Oracle and MySQL respectively.)

> This will break lots of things, not just UTF-8 decoders (most of which,
> I bet, will never actually notice).  -T

Modern ones that pay attention to spoofing most definitely will.

John Cowan                         
        "You need a change: try Canada"  "You need a change: try China"
                --fortune cookies opened by a couple that I know