Re: [Json] Proposal for strings/Unicode text

John Cowan <cowan@mercury.ccil.org> Thu, 13 June 2013 18:20 UTC

Date: Thu, 13 Jun 2013 14:19:55 -0400
From: John Cowan <cowan@mercury.ccil.org>
To: Tim Bray <tbray@textuality.com>
Message-ID: <20130613181955.GH29284@mercury.ccil.org>
References: <CAHBU6ivNjMUwN2Hsn-E8FKxjqXS6b4qz=_MeeaHahWBWqG_Hgg@mail.gmail.com> <ED62F638-C0C4-411D-BA5B-EB9BA71EDB75@lindenbergsoftware.com> <20130613003213.GA26989@mercury.ccil.org> <jr5jr85h6pig2cr9id5hf1eh586g0u09i7@hive.bjoern.hoehrmann.de> <20130613121620.GB11739@mercury.ccil.org> <CAHBU6ismp6HZqUQOgDnjBRYtC5jFCzhTB3RFG8Ms7qohz+w1eg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAHBU6ismp6HZqUQOgDnjBRYtC5jFCzhTB3RFG8Ms7qohz+w1eg@mail.gmail.com>
User-Agent: Mutt/1.5.20 (2009-06-14)
Sender: John Cowan <cowan@ccil.org>
Cc: Bjoern Hoehrmann <derhoermi@gmx.net>, "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Proposal for strings/Unicode text
Precedence: list

Tim Bray scripsit:

> Why?  UTF-8 is perfectly capable of representing those integers.  Yes, the
> spec says that You Shouldn’t Do That, but it says the same thing about
> unpaired surrogates in UTF-16.  

If this is going to be just another informational RFC, then of course
we can do what we want.  But if it's a standards-track RFC, it has to
play nicely with other standards-track RFCs.  And RFC 3629 aka STD 63 is
not just standards-track, it's an Internet Standard, it's ten years old,
and it says uncompromisingly:

   The definition of UTF-8 prohibits encoding character numbers between
   U+D800 and U+DFFF, which are reserved for use with the UTF-16
   encoding form (as surrogate pairs) and do not directly represent
   characters.  When encoding in UTF-8 from UTF-16 data, it is necessary
   to first decode the UTF-16 data to obtain character numbers, which
   are then encoded in UTF-8 as described above.  This contrasts with
   CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
   use on the Internet.  CESU-8 operates similarly to UTF-8 but encodes
   the UTF-16 code values (16-bit quantities) instead of the character
   number (code point).  This leads to different results for character
   numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT
   valid UTF-8.

The Unicode definition today is the same, though in the past it's been
more wishy-washy. (CESU-8, BTW, is the official name for Oracle's and
MySQL's "UTF-8" encoding for database strings: the real thing is called
"AL32UTF8" and "utf8mb4" in Oracle and MySQL respectively.)

> This will break lots of things, not just UTF-8 decoders (most of which,
> I bet, will never actually notice).  -T

Modern ones that pay attention to spoofing most definitely will.

-- 
John Cowan                                   cowan@ccil.org
        "You need a change: try Canada"  "You need a change: try China"
                --fortune cookies opened by a couple that I know

[Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Carsten Bormann
Re: [Json] Proposal for strings/Unicode text Norbert Lindenberg
Re: [Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Carsten Bormann
Re: [Json] Proposal for strings/Unicode text R S
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Bjoern Hoehrmann
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Joe Hildebrand (jhildebr)
Re: [Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Nico Williams
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Manger, James H
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text Joe Hildebrand (jhildebr)
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text Joe Hildebrand (jhildebr)
Re: [Json] Proposal for strings/Unicode text Carsten Bormann
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text Norbert Lindenberg
Re: [Json] Proposal for strings/Unicode text Norbert Lindenberg