Re: [Json] On characters and code points

Tim Bray <tbray@textuality.com> Fri, 07 June 2013 16:02 UTC

Return-Path: <tbray@textuality.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6D0DC21F8887 for <json@ietfa.amsl.com>; Fri, 7 Jun 2013 09:02:20 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.81
X-Spam-Level:
X-Spam-Status: No, score=0.81 tagged_above=-999 required=5 tests=[AWL=-0.547, BAYES_00=-2.599, FH_RELAY_NODNS=1.451, FM_FORGED_GMAIL=0.622, HTML_MESSAGE=0.001, RCVD_IN_PBL=0.905, RCVD_IN_SORBS_DUL=0.877, RDNS_NONE=0.1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id j0BciLncIiwP for <json@ietfa.amsl.com>; Fri, 7 Jun 2013 09:02:13 -0700 (PDT)
Received: from mail-vb0-x22f.google.com (mail-vb0-x22f.google.com [IPv6:2607:f8b0:400c:c02::22f]) by ietfa.amsl.com (Postfix) with ESMTP id BB6BF21F9744 for <json@ietf.org>; Fri, 7 Jun 2013 09:01:41 -0700 (PDT)
Received: by mail-vb0-f47.google.com with SMTP id x14so2800187vbb.20 for <json@ietf.org>; Fri, 07 Jun 2013 09:01:39 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-originating-ip:in-reply-to:references:date :message-id:subject:from:to:cc:content-type:x-gm-message-state; bh=H7h+62GBPc3jhqSvzH3hkK4PEKJdwG3w/QmblNK+zYU=; b=cMW/7l1+FALyYtuu1Cn4PGvS5qb/8+W9bbBoa2NrhNPWqKuHXUe6uhZRoSrxFctatg +HIE63YGX61iHiCfAaD2/ZAAjiUuEuGeUXrv7sialS9BlXR82peCk3HxV2xqjwfFptbC F5ddnAxpaZz8f6BL9cpds772WG1qfYtx9DtPb6E9UVKh31q4XT1jUNijSwwfpZzRXXLM h69smB2A5keM+GlFy31LLw7wxUjk4s+LjX+wllRvW0K6ty/vEJE//P9xiGCF0q+clLFx n62qXZOndvE8tygr1TMbL/WHz0mzEa/Eug08884W7GCNUd1v2unN7+ul9vWsk69YPLzW YV3g==
MIME-Version: 1.0
X-Received: by 10.52.237.228 with SMTP id vf4mr3232968vdc.79.1370620899451; Fri, 07 Jun 2013 09:01:39 -0700 (PDT)
Received: by 10.220.48.14 with HTTP; Fri, 7 Jun 2013 09:01:39 -0700 (PDT)
X-Originating-IP: [24.84.235.32]
In-Reply-To: <56A163E9-E7CD-46B3-9984-8F009EBFF500@vpnc.org>
References: <A723FC6ECC552A4D8C8249D9E07425A70FC2E7E1@xmb-rcd-x10.cisco.com> <51B06F38.8050707@crockford.com> <CAHBU6iuFBuW-RfgBLQF5q4BnUOzs088QXW3uOQG1OjBFjZttkw@mail.gmail.com> <51B1B4E7.8090101@it.aoyama.ac.jp> <9ld3r8pc0tufif18dohb2fmi0ijna1vs4n@hive.bjoern.hoehrmann.de> <56A163E9-E7CD-46B3-9984-8F009EBFF500@vpnc.org>
Date: Fri, 07 Jun 2013 09:01:39 -0700
Message-ID: <CAHBU6ivG=ONc8roT7W=LdpKYNMqRH_d5BobZ=pHnk=mVaKZKaA@mail.gmail.com>
From: Tim Bray <tbray@textuality.com>
To: Paul Hoffman <paul.hoffman@vpnc.org>
Content-Type: multipart/alternative; boundary="089e0122f6aad575d404de928b10"
X-Gm-Message-State: ALoCoQmdT++1dsF4sWmrVln/Lwrzd0A+NbqHEWqfdk6YS0a1P7v5zxvDCp97iTIVzNhGvIA/yRbi
Cc: "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] On characters and code points
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 07 Jun 2013 16:02:22 -0000

On Fri, Jun 7, 2013 at 8:56 AM, Paul Hoffman <paul.hoffman@vpnc.org> wrote:

> This may be a part of the spec where some people have to hold their noses.
> The Unicode definition of "character" does not include non-characters, and
> the code points for some of those non-characters make sense in JSON strings
> when those strings. Bjoern has pointed out a good one: strings used for
> test cases of other code. The issue not just unpaired surrogates. Do we
> *really* want to prohibit:
>    { "End of data marker": "\uFFFF" }
>

Yes, I *really* want to prohibit that. The one corner case it buys you is
outweighed by a factor of a thousand or so in not being able to use
general-purpose string processing software to deal with JSON payloads.
BTW, a huge amount of deployed software out there ALREADY processes JSON
text fields using general-purpose string processing libraries, and will
explode unpredictably and in hard-to-debug ways if this starts happening.

Also, consider the lovely consequences when unpaired surrogates start
showing  up in key fields and are fed to hash functions in every
programming language in the world, which expect to receive Unicode
characters.

 -T



>
> Proposal:
>
> Remove the word "character" from the spec except in an explanatory
> paragraph in Section 2.5 that says:
>    All code points, even those that represent non-characters in the
> Unicode specification [UNICODE], are allowed in JSON strings.
>
> --Paul Hoffman
> _______________________________________________
> json mailing list
> json@ietf.org
> https://www.ietf.org/mailman/listinfo/json
>