Re: [Json] Proposal for strings/Unicode text

John Cowan <cowan@mercury.ccil.org> Thu, 13 June 2013 18:20 UTC

Return-Path: <cowan@ccil.org>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D3D5621F8EB3 for <json@ietfa.amsl.com>; Thu, 13 Jun 2013 11:20:11 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.139
X-Spam-Level:
X-Spam-Status: No, score=-3.139 tagged_above=-999 required=5 tests=[AWL=-0.140, BAYES_00=-2.599, J_CHICKENPOX_14=0.6, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id CQB5JKUX2hmH for <json@ietfa.amsl.com>; Thu, 13 Jun 2013 11:19:57 -0700 (PDT)
Received: from earth.ccil.org (earth.ccil.org [192.190.237.11]) by ietfa.amsl.com (Postfix) with ESMTP id A213A21F881F for <json@ietf.org>; Thu, 13 Jun 2013 11:19:57 -0700 (PDT)
Received: from cowan by earth.ccil.org with local (Exim 4.72) (envelope-from <cowan@ccil.org>) id 1UnC7n-0004iT-HR; Thu, 13 Jun 2013 14:19:55 -0400
Date: Thu, 13 Jun 2013 14:19:55 -0400
From: John Cowan <cowan@mercury.ccil.org>
To: Tim Bray <tbray@textuality.com>
Message-ID: <20130613181955.GH29284@mercury.ccil.org>
References: <CAHBU6ivNjMUwN2Hsn-E8FKxjqXS6b4qz=_MeeaHahWBWqG_Hgg@mail.gmail.com> <ED62F638-C0C4-411D-BA5B-EB9BA71EDB75@lindenbergsoftware.com> <20130613003213.GA26989@mercury.ccil.org> <jr5jr85h6pig2cr9id5hf1eh586g0u09i7@hive.bjoern.hoehrmann.de> <20130613121620.GB11739@mercury.ccil.org> <CAHBU6ismp6HZqUQOgDnjBRYtC5jFCzhTB3RFG8Ms7qohz+w1eg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAHBU6ismp6HZqUQOgDnjBRYtC5jFCzhTB3RFG8Ms7qohz+w1eg@mail.gmail.com>
User-Agent: Mutt/1.5.20 (2009-06-14)
Sender: John Cowan <cowan@ccil.org>
Cc: Bjoern Hoehrmann <derhoermi@gmx.net>, "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Proposal for strings/Unicode text
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 13 Jun 2013 18:20:13 -0000

Tim Bray scripsit:

> Why?  UTF-8 is perfectly capable of representing those integers.  Yes, the
> spec says that You Shouldn’t Do That, but it says the same thing about
> unpaired surrogates in UTF-16.  

If this is going to be just another informational RFC, then of course
we can do what we want.  But if it's a standards-track RFC, it has to
play nicely with other standards-track RFCs.  And RFC 3629 aka STD 63 is
not just standards-track, it's an Internet Standard, it's ten years old,
and it says uncompromisingly:

   The definition of UTF-8 prohibits encoding character numbers between
   U+D800 and U+DFFF, which are reserved for use with the UTF-16
   encoding form (as surrogate pairs) and do not directly represent
   characters.  When encoding in UTF-8 from UTF-16 data, it is necessary
   to first decode the UTF-16 data to obtain character numbers, which
   are then encoded in UTF-8 as described above.  This contrasts with
   CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
   use on the Internet.  CESU-8 operates similarly to UTF-8 but encodes
   the UTF-16 code values (16-bit quantities) instead of the character
   number (code point).  This leads to different results for character
   numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT
   valid UTF-8.

The Unicode definition today is the same, though in the past it's been
more wishy-washy. (CESU-8, BTW, is the official name for Oracle's and
MySQL's "UTF-8" encoding for database strings: the real thing is called
"AL32UTF8" and "utf8mb4" in Oracle and MySQL respectively.)

> This will break lots of things, not just UTF-8 decoders (most of which,
> I bet, will never actually notice).  -T

Modern ones that pay attention to spoofing most definitely will.

-- 
John Cowan                                   cowan@ccil.org
        "You need a change: try Canada"  "You need a change: try China"
                --fortune cookies opened by a couple that I know