Re: [Json] A possible summary of the discussion so far on code points and characters

R S <sayrer@gmail.com> Sun, 09 June 2013 00:24 UTC

Return-Path: <sayrer@gmail.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 9F67A21F9385 for <json@ietfa.amsl.com>; Sat, 8 Jun 2013 17:24:00 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.556
X-Spam-Level:
X-Spam-Status: No, score=-2.556 tagged_above=-999 required=5 tests=[AWL=0.043, BAYES_00=-2.599, HTML_MESSAGE=0.001, NO_RELAYS=-0.001]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id hIEcchb+5DHi for <json@ietfa.amsl.com>; Sat, 8 Jun 2013 17:23:59 -0700 (PDT)
Received: from mail-we0-x229.google.com (mail-we0-x229.google.com [IPv6:2a00:1450:400c:c03::229]) by ietfa.amsl.com (Postfix) with ESMTP id 5476421F9007 for <json@ietf.org>; Sat, 8 Jun 2013 17:23:59 -0700 (PDT)
Received: by mail-we0-f169.google.com with SMTP id n57so4015793wev.28 for <json@ietf.org>; Sat, 08 Jun 2013 17:23:57 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=Cojm1mCvw5Ly6PP9wGGhJbXIG6mCOK2yllvGNd+iRmU=; b=d2juJUglQdlgRabo0EF2+v/Ea1MJ2PgZ78C9eU/NBpCnUlxdvkEjSXNvwLCj2C+1KE t0yhkusuYOkD0Znpr52ADOTeVtbV8i1ygO7vXv7hUkVs9bbKSTEfwp69u2kMyH+aTUdz ERKZsA8ixCrucBhYDvRMUpAXgMQmdd/01cQ1ZkG3ucI2FcDrKMRdoRwq/ULjHLYaTGIv FfvFCh8bkFH9gNeiqmzxuYrBPWhR+c8WnhEtl+Djt+CqQeDvE39hAkj7qnTXVNU4WVMP 1ydBgtNZTX9ytOer61L+2enCuasO6roDQD5QmKBOSjCWHb75dJU+eXktZBtQ5HGPR+/a JsPw==
MIME-Version: 1.0
X-Received: by 10.194.63.229 with SMTP id j5mr2321379wjs.79.1370737437264; Sat, 08 Jun 2013 17:23:57 -0700 (PDT)
Received: by 10.194.83.35 with HTTP; Sat, 8 Jun 2013 17:23:57 -0700 (PDT)
In-Reply-To: <CA+mHimPdoN0vf8c3AzYrZ8HXgPbUJPkvViwU4iWrcZBBKJRmNg@mail.gmail.com>
References: <AF793CAF-B30B-44A7-B864-82CEF79EA34D@vpnc.org> <CAChr6SwLDCUk0DC9pGTKqUu_V5vJHvs7Sgv4EneTJMryn1iKSA@mail.gmail.com> <CA+mHimPdoN0vf8c3AzYrZ8HXgPbUJPkvViwU4iWrcZBBKJRmNg@mail.gmail.com>
Date: Sat, 8 Jun 2013 17:23:57 -0700
Message-ID: <CAChr6SyM0ERZ6bqEbG4ULDZx-MsKo8sx-9WB5sVLFyONm++kbQ@mail.gmail.com>
From: R S <sayrer@gmail.com>
To: Stephen Dolan <stephen.dolan@cl.cam.ac.uk>
Content-Type: multipart/alternative; boundary=047d7ba9751807527804deadaeb2
Cc: Paul Hoffman <paul.hoffman@vpnc.org>, "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] A possible summary of the discussion so far on code points and characters
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 09 Jun 2013 00:24:00 -0000

On Sat, Jun 8, 2013 at 2:11 PM, Stephen Dolan <stephen.dolan@cl.cam.ac.uk>wrote;wrote:

> On Sat, Jun 8, 2013 at 9:52 PM, R S <sayrer@gmail.com> wrote:
> > A seventh point of view, which I happen to agree with: JSON strings are a
> > sequence of code units.
> >
> > This is similar to the definition of 'source text' in ECMAScript:
> >
> > "ECMAScript source text is assumed to be a sequence of 16-bit code units
> for
> > the purposes of this specification. Such a source text may include
> sequences
> > of 16-bit code units that are not valid UTF-16 character encodings."
>
> That's a very out-of-context quote. The linked document states:
>
> "ECMAScript source text is represented as a sequence of characters in
> the Unicode character encoding, version 3.0 or later."
>
> It then gives your quote, and states "If an actual source text is
> encoded in a form other than 16-bit code units it must be processed as
> if it was first convert [sic] to UTF-16". It seems like UTF-16 is a
> convenient way to frame the document, rather than a requirement of the
> specification.


It's a requirement. Here are some additional references:

<
http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings&rev=1305822947
>
<https://mail.mozilla.org/pipermail/es-discuss/2011-May/014337.html>

The paragraph following the one I cited:

'Throughout the rest of this document, the phrase “code unit” and the word
“character” will be used to refer to a 16-bit unsigned value used to
represent a single 16-bit unit of text. The phrase “Unicode character” will
be used to refer to the abstract linguistic or typographical unit
represented by a single Unicode scalar value (which may be longer than 16
bits and thus may be represented by more than one code unit). The phrase
“code point” refers to such a Unicode scalar value. “Unicode character”
only refers to entities represented by single Unicode scalar values: the
components of a combining character sequence are still individual “Unicode
characters,” even though a user might think of the whole sequence as a
single character.' <http://es5.github.io/x6.html>

- Rob