Re: [Json] On characters and code points

Stephen Dolan <stephen.dolan@cl.cam.ac.uk> Fri, 07 June 2013 16:09 UTC

Return-Path: <stedolan@stedolan.net>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C542D21F9642 for <json@ietfa.amsl.com>; Fri, 7 Jun 2013 09:09:52 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.402
X-Spam-Level:
X-Spam-Status: No, score=-1.402 tagged_above=-999 required=5 tests=[AWL=0.975, BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, J_CHICKENPOX_14=0.6, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 84fyfREVDtaZ for <json@ietfa.amsl.com>; Fri, 7 Jun 2013 09:09:48 -0700 (PDT)
Received: from mail-lb0-f173.google.com (mail-lb0-f173.google.com [209.85.217.173]) by ietfa.amsl.com (Postfix) with ESMTP id 2E35521F9600 for <json@ietf.org>; Fri, 7 Jun 2013 09:09:47 -0700 (PDT)
Received: by mail-lb0-f173.google.com with SMTP id t10so4440789lbi.18 for <json@ietf.org>; Fri, 07 Jun 2013 09:09:46 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:sender:x-originating-ip:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding:x-gm-message-state; bh=BSiOP+hTDDS842gtMHaGYEP1sSpAVWYXMPPepSaTmZk=; b=K7x8TurYIkxzOHGzQLeR+IPx2n9zDKNgeWKEVL792rYzp+hkogy/h4KvAJIDKKKN5T tVX6nQPFJ5slFmlsx1RCRAAYFPo0MpKNMBg7FRUYR4kHXSFU9lpqjJ26qDgsWa3f5Apv Hei7iFWVhSRYqzSsPi1jBQGo9nV+fonRWygV4fDh/XGc097cL/f9Gs22AoJsmIpieYrp A5JX1HDdUGjtgE9v5Eo6I4jTvWRT8SIKrtzgBVe61FYcKmEZifkStcEKRGunM6GPyp78 AvdSXzzJsSvCW92tYDxLnYVCLgeKhisFlvJtwZnQ5FNPliwzgyacMSCveMu79yj7AQl7 kdCw==
MIME-Version: 1.0
X-Received: by 10.112.144.69 with SMTP id sk5mr1530094lbb.64.1370621386619; Fri, 07 Jun 2013 09:09:46 -0700 (PDT)
Sender: stedolan@stedolan.net
Received: by 10.114.186.41 with HTTP; Fri, 7 Jun 2013 09:09:46 -0700 (PDT)
X-Originating-IP: [128.232.9.157]
In-Reply-To: <56A163E9-E7CD-46B3-9984-8F009EBFF500@vpnc.org>
References: <A723FC6ECC552A4D8C8249D9E07425A70FC2E7E1@xmb-rcd-x10.cisco.com> <51B06F38.8050707@crockford.com> <CAHBU6iuFBuW-RfgBLQF5q4BnUOzs088QXW3uOQG1OjBFjZttkw@mail.gmail.com> <51B1B4E7.8090101@it.aoyama.ac.jp> <9ld3r8pc0tufif18dohb2fmi0ijna1vs4n@hive.bjoern.hoehrmann.de> <56A163E9-E7CD-46B3-9984-8F009EBFF500@vpnc.org>
Date: Fri, 07 Jun 2013 17:09:46 +0100
X-Google-Sender-Auth: TE3FAOTdp5YpvZKzfhg_UY9cBPw
Message-ID: <CA+mHimO-bUvodjgM89Nskg+tqWrsTAfL8EWRx++fd16t1hFR_g@mail.gmail.com>
From: Stephen Dolan <stephen.dolan@cl.cam.ac.uk>
To: Paul Hoffman <paul.hoffman@vpnc.org>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: quoted-printable
X-Gm-Message-State: ALoCoQluaeVTbcWcZZZgR1f/GEpVhJsMAyy+YOrsGdhKIremcXFBaotMuv1mYbSAEvWjvM7MD803
Cc: "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] On characters and code points
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 07 Jun 2013 16:09:52 -0000

I think it is useful to distinguish three cases of codepoint:
 (1) Those which are valid characters in a particular Unicode revision
 (2) Those which are unallocated codepoints which may become valid
characters in a later Unicode revision
 (3) The noncharacter codepoints which will never be valid

(3) includes such beasts as U+FFFE (which you can only get by reading
a UTF16 byte order mark with the wrong byte order). The set (1)
increases with every Unicode revision to include characters from (2),
but (3) is stable (see
http://unicode.org/policies/stability_policy.html).

I think JSON should allow characters from (1) and (2) to avoid being
dependent on a specific Unicode revision. I do not think (3) should be
allowed - this would cause problems with many existing parsers which
represent JSON strings using another system's native unicode
representation.

The argument about testsuites does not seem compelling, as any such
testsuite testing behaviour of string functions with bad Unicode would
also include invalidly-encoded Unicode (such as overlong UTF8
sequences) which cannot be represented at all in JSON, even with
escaping.

Stephen

On Fri, Jun 7, 2013 at 4:56 PM, Paul Hoffman <paul.hoffman@vpnc.org> wrote:
> <no hat>
>
> This may be a part of the spec where some people have to hold their noses. The Unicode definition of "character" does not include non-characters, and the code points for some of those non-characters make sense in JSON strings when those strings. Bjoern has pointed out a good one: strings used for test cases of other code. The issue not just unpaired surrogates. Do we *really* want to prohibit:
>    { "End of data marker": "\uFFFF" }
>
> Proposal:
>
> Remove the word "character" from the spec except in an explanatory paragraph in Section 2.5 that says:
>    All code points, even those that represent non-characters in the Unicode specification [UNICODE], are allowed in JSON strings.
>
> --Paul Hoffman
> _______________________________________________
> json mailing list
> json@ietf.org
> https://www.ietf.org/mailman/listinfo/json