[Json] Proposal for strings/Unicode text

Tim Bray <tbray@textuality.com> Wed, 12 June 2013 18:47 UTC

Return-Path: <tbray@textuality.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 74DE521E80BA for <json@ietfa.amsl.com>; Wed, 12 Jun 2013 11:47:01 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.692
X-Spam-Level:
X-Spam-Status: No, score=-2.692 tagged_above=-999 required=5 tests=[AWL=1.684, BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, GB_I_LETTER=-2, HTML_MESSAGE=0.001, J_CHICKENPOX_82=0.6, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id krOgrXFy0T8L for <json@ietfa.amsl.com>; Wed, 12 Jun 2013 11:46:55 -0700 (PDT)
Received: from mail-vc0-f174.google.com (mail-vc0-f174.google.com [209.85.220.174]) by ietfa.amsl.com (Postfix) with ESMTP id 6BD2E11E80EF for <json@ietf.org>; Wed, 12 Jun 2013 11:46:55 -0700 (PDT)
Received: by mail-vc0-f174.google.com with SMTP id kw10so6382151vcb.19 for <json@ietf.org>; Wed, 12 Jun 2013 11:46:54 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-originating-ip:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=446EvbF5jeXtskI/pWNTG6jpxj+kgoufRuZrjMDkEDo=; b=PwlJ9XK6yNzGm0uaEgKNVBi8JuKlIyDGa/KfefQSpy23b3gNt91dWoUfcrG1sy0Nhd rD8z7+b2JmpGYrg2PSTD1086dCyJGfWVotlCJc3RgIpQZ98XaU0udQledg1ndptmqfBB ecGwP9WZQPXhTASQPzF68AgMdGJs5yqo7hRMgWKWGXHkrjS6j1y+wtIq8K+EDlcE5anK go59uA+9VfJ08Z0Y9YGf4e/JrnuTB+54Hep+AHhKHVHTDUCDPyJTJ9jfjxMvCw8HQLmO qIhZRATKlecWHMIjWhEp5uAGLClwNUgP2HeXZSyE5BugSs4lDcLYOCfqMb6jQmkZgUBa 4teA==
MIME-Version: 1.0
X-Received: by 10.52.30.14 with SMTP id o14mr8666288vdh.106.1371062814722; Wed, 12 Jun 2013 11:46:54 -0700 (PDT)
Received: by 10.220.25.199 with HTTP; Wed, 12 Jun 2013 11:46:54 -0700 (PDT)
X-Originating-IP: [96.49.81.176]
Date: Wed, 12 Jun 2013 11:46:54 -0700
Message-ID: <CAHBU6ivNjMUwN2Hsn-E8FKxjqXS6b4qz=_MeeaHahWBWqG_Hgg@mail.gmail.com>
From: Tim Bray <tbray@textuality.com>
To: "json@ietf.org" <json@ietf.org>
Content-Type: multipart/alternative; boundary="20cf307ca1e60962e704def970e9"
X-Gm-Message-State: ALoCoQnReAgvzM1fafQgMbLrlRadXc8QNwUb+mTRndVCWwMHZxxrRvlDp8z05HErJcLKFXgsixw3
Subject: [Json] Proposal for strings/Unicode text
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 12 Jun 2013 18:47:01 -0000

Rationale:
- emphasize the important fact that Strings are *intended* for Unicode
characters
- document the important fact that the rules allow horrible Unicode
practices
- say “backslash” instead of “reverse solidus” :)

In section 1, introduction

Before:
   A string is a sequence of zero or more Unicode characters [UNICODE].
After:
   A string is intended to contain sequences of zero or more Unicode
characters [UNICODE 6.2]

Rewrite section 2.5 as follows:

Strings begin and end with quotation marks.  They are intended,to contain
sequences of Unicode characters; Note however that the ABNF in this section
allows the inclusion of 16-bit quantities in ways which can never be useful
for representing characters and is likely to cause breakage in software
designed to process Unicode text.

The ABNF allows the use of many Unicode code points that could be used in
future to represent Unicode characters, but have not yet been assigned.
Therefore, this specification should not need revision as the Unicode
character repertoire continues to grow.

16-bit quantities (normally Unicode characters from the Basic Multingual
Pane(U+0000 through U+FFFF) may be “escaped”, or represented as a
six-character sequence: a backslash (U+005C REVERSE SOLIDUS), followed  by
the lowercase letter u, followed by four hexadecimal digits that encode the
character's code point.  The hexadecimal letters A though F can be upper or
lower case.  So, for example, a string containing only a single backslash
may be represented as "\u005C".

Alternatively, there are two-character sequence escape representations of
some popular characters.  So, for example, a string containing only a
single backslash may be represented more compactly as "\\".

 To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair.  So, for example, a string containing
only U+1D11E G CLEF may be represented as
   "\uD834\uDD1E".

=== insert ABNF here ====