Re: [Json] Unpaired surrogates in JSON strings

"Joe Hildebrand (jhildebr)" <jhildebr@cisco.com> Thu, 06 June 2013 17:49 UTC

Return-Path: <jhildebr@cisco.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E558521F9AE1 for <json@ietfa.amsl.com>; Thu, 6 Jun 2013 10:49:20 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -10.599
X-Spam-Level:
X-Spam-Status: No, score=-10.599 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, RCVD_IN_DNSWL_HI=-8]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ZBbiJABnEy4F for <json@ietfa.amsl.com>; Thu, 6 Jun 2013 10:49:15 -0700 (PDT)
Received: from rcdn-iport-6.cisco.com (rcdn-iport-6.cisco.com [173.37.86.77]) by ietfa.amsl.com (Postfix) with ESMTP id 50BD221F9AD7 for <json@ietf.org>; Thu, 6 Jun 2013 10:49:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cisco.com; i=@cisco.com; l=1229; q=dns/txt; s=iport; t=1370540955; x=1371750555; h=from:to:cc:subject:date:message-id:in-reply-to: content-id:content-transfer-encoding:mime-version; bh=kv8yXmPyjUTITCzFVLMLYMywh03hbXW0poS3J8rtuy0=; b=Sr7jll8p9dh7oxopOTIbcSvm2HHqVm5DHGxoLDszC3UfMXgefM/krpkt Tljdk/JAL7DY0kGEDsQv7rFi+K5pPomdbvZl9Z11PgnwjkLSVeYEsZz2W BxN7tsmhwPHZHrTkMmLKHS6Euy7ZDTtC2WWuFCvpsDXNC4JAmKUXS/a1g I=;
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AggFADXLsFGtJV2c/2dsb2JhbABZgwm/d3kWdIIjAQEBAwE6PxIBCCIUQiUCBA4FCId/Brt9jwExB4J6YQOTbZUSgw+CJw
X-IronPort-AV: E=Sophos;i="4.87,816,1363132800"; d="scan'208";a="219702172"
Received: from rcdn-core-5.cisco.com ([173.37.93.156]) by rcdn-iport-6.cisco.com with ESMTP; 06 Jun 2013 17:49:05 +0000
Received: from xhc-aln-x04.cisco.com (xhc-aln-x04.cisco.com [173.36.12.78]) by rcdn-core-5.cisco.com (8.14.5/8.14.5) with ESMTP id r56Hn5N4011269 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL); Thu, 6 Jun 2013 17:49:05 GMT
Received: from xmb-rcd-x10.cisco.com ([169.254.15.56]) by xhc-aln-x04.cisco.com ([173.36.12.78]) with mapi id 14.02.0318.004; Thu, 6 Jun 2013 12:49:05 -0500
From: "Joe Hildebrand (jhildebr)" <jhildebr@cisco.com>
To: Douglas Crockford <douglas@crockford.com>
Thread-Topic: [Json] Unpaired surrogates in JSON strings
Thread-Index: AQHOYkg5aH+sWe/75UqTmsr195d0xpkoJhcAgAAN2ICAAAJhAIAApyAAgAAJfgA=
Date: Thu, 06 Jun 2013 17:49:04 +0000
Message-ID: <A723FC6ECC552A4D8C8249D9E07425A70FC30833@xmb-rcd-x10.cisco.com>
In-Reply-To: <51B06F38.8050707@crockford.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
user-agent: Microsoft-MacOutlook/14.3.4.130416
x-originating-ip: [10.21.88.234]
Content-Type: text/plain; charset="us-ascii"
Content-ID: <2312B8ABD1760946ABEDFF8AB4434E70@emea.cisco.com>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Cc: Tim Bray <tbray@textuality.com>, Paul Hoffman <paul.hoffman@vpnc.org>, "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Unpaired surrogates in JSON strings
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 06 Jun 2013 17:49:21 -0000

On 6/6/13 5:15 AM, "Douglas Crockford" <douglas@crockford.com> wrote:

>> That's not a code point.  That's half a surrogate pair for a code point
>> encoded in UTF16.  It's only the same in the BMP.
>>
>What  then is the standard name for a 16-bit element of text? When
>JavaScript was created, that word was character. What is the word now?

Let's say rather that we understand Unicode better than we did then.  When
we only cared about the Basic Multilingual Plane, it was easy to just say
"wchar_t", and hope the compiler writer knew more than we did.
Technically, wchar_t was supposed to be big enough to hold any supported
code point, but in practice, it was usually 16 bits.  This forced people
to start using UTF16 in wchar_t's when they needed to access codepoints
outside the BMP.  This was a hack, but the Java, JavaScript, and .Net
folks went with it in order to make their implementations a little easier.
 Some newer languages are using UTF8 as their internal String
representation, which works ok.  Note that counting code points correctly
in either system requires a full scan of the string.  UTF32 doesn't suffer
from this, but always requires 4 bytes per code point.

-- 
Joe Hildebrand