Re: [Json] Proposal for strings/Unicode text

Tim Bray <tbray@textuality.com> Wed, 12 June 2013 22:50 UTC

Return-Path: <tbray@textuality.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5B29E21F994E for <json@ietfa.amsl.com>; Wed, 12 Jun 2013 15:50:08 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.073
X-Spam-Level:
X-Spam-Status: No, score=-1.073 tagged_above=-999 required=5 tests=[AWL=-0.125, BAYES_00=-2.599, FH_RELAY_NODNS=1.451, FM_FORGED_GMAIL=0.622, GB_I_LETTER=-2, HTML_MESSAGE=0.001, J_CHICKENPOX_82=0.6, RCVD_IN_SORBS_DUL=0.877, RDNS_NONE=0.1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id uvpgVq6TSQ4F for <json@ietfa.amsl.com>; Wed, 12 Jun 2013 15:50:03 -0700 (PDT)
Received: from mail-ve0-x231.google.com (mail-ve0-x231.google.com [IPv6:2607:f8b0:400c:c01::231]) by ietfa.amsl.com (Postfix) with ESMTP id 979A221F994C for <json@ietf.org>; Wed, 12 Jun 2013 15:49:57 -0700 (PDT)
Received: by mail-ve0-f177.google.com with SMTP id cz10so7242762veb.36 for <json@ietf.org>; Wed, 12 Jun 2013 15:49:57 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-originating-ip:in-reply-to:references:date :message-id:subject:from:to:cc:content-type:x-gm-message-state; bh=b0fKygUVrpWzVzqd6suigkGxCJj4+y0fB9bzJ0yWxVY=; b=fFUxHT4O3jfqgWxVSIyPIIv8HPRU/KUO8cUBSU8P2BfsgHyQKjCDdrLhoCLDzy7jq3 5/tNlnyzgJa4QO9B40NTACBt5DGtzB2Gpz3PFd3B/sZ8t0bevTQrwKkaNDAEEAB0/Qa7 mcLPDf5nmfClcZjOZrQ+REqDjUP0NguATRxMEBrer5xknjhyWb3O4q5mjJmnyll2JhGA ZGklBbQKsyUNAD82/nfJmaA1QvCnCed4xMWzSEsz5uVej9dIAIsSCkwhLS/YPrpOOkdL zke6+PfUX+4rjEpDeAFeGAB2YhKNC+dN5+bTgRJ7HvcPRNzesFPUCLCjB7yGBZPKF8a9 zcDQ==
MIME-Version: 1.0
X-Received: by 10.52.30.14 with SMTP id o14mr8923277vdh.106.1371077396879; Wed, 12 Jun 2013 15:49:56 -0700 (PDT)
Received: by 10.220.25.199 with HTTP; Wed, 12 Jun 2013 15:49:56 -0700 (PDT)
X-Originating-IP: [96.49.81.176]
In-Reply-To: <3E2A194C-789F-4E75-ABB0-CE966319463E@tzi.org>
References: <CAHBU6ivNjMUwN2Hsn-E8FKxjqXS6b4qz=_MeeaHahWBWqG_Hgg@mail.gmail.com> <3E2A194C-789F-4E75-ABB0-CE966319463E@tzi.org>
Date: Wed, 12 Jun 2013 15:49:56 -0700
Message-ID: <CAHBU6ivw=4WfTyXdBns-i30fvzhkb+Zs_puj=YhFw+fh6n3R7A@mail.gmail.com>
From: Tim Bray <tbray@textuality.com>
To: Carsten Bormann <cabo@tzi.org>
Content-Type: multipart/alternative; boundary="20cf307ca1e6337c3c04defcd5d5"
X-Gm-Message-State: ALoCoQk/YK9D3gbaccdZk1Nx0m9O5lFPOurxqZ80UYLgDBoR3zF5tGfGtgVvBi3sk8YoonQI40jO
Cc: "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Proposal for strings/Unicode text
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 12 Jun 2013 22:50:08 -0000

Revised per Carsten's  and Norbert’s input. I’m unconvinced about Appendix
M since I think the “16-bit quantities” para says what needs saying, but he
caught a few bugs.

Before:
   A string is a sequence of zero or more Unicode characters [UNICODE].
After:
   A string is intended to contain a sequence of zero or more Unicode
characters [UNICODE 6.2]

Rewrite section 2.5 as follows:

Strings are delimited  with quotation marks (U+0022 QUOTATION MARK).  They
are intended,to contain sequences of Unicode characters.  Note however that
the normative ABNF in this section allows the inclusion of 16-bit
quantities, for example unpaired surrogate-block code points, in ways which
can never be useful for representing characters and is likely to cause
breakage in software designed to process Unicode text.

The ABNF allows the use of many Unicode code points that could be used in
future to represent Unicode characters, but have not yet been assigned.
Therefore, this specification should not need revision as the Unicode
character repertoire continues to grow.

16-bit quantities, normally representing Unicode characters from the Basic
Multingual Pane (U+0000 through U+FFFF), may be “escaped”, or represented
as a six-character sequence: a backslash (U+005C REVERSE SOLIDUS),
followed  by the lowercase letter u, followed by four hexadecimal digits
that encode the character's code point.  The hexadecimal letters A though F
can be upper or lower case.  So, for example, a string containing only a
single backslash may be represented as "\u005C".

Alternatively, there are two-character sequence escape representations of
some popular characters.  So, for example, a string containing only a
single backslash may be represented more compactly as "\\".

 To escape a non-BMP character that is not in the Basic Multilingual Plane,
the character is represented as a twelve-character sequence, encoding the
UTF-16 surrogate pair.  So, for example, a string containing only U+1D11E G
CLEF may be represented as
   "\uD834\uDD1E".

=== insert ABNF here ====


On Wed, Jun 12, 2013 at 3:30 PM, Carsten Bormann <cabo@tzi.org> wrote:

> Hmm.
>
> Somehow I think the JSON specification should focus on describing what the
> intended usage is.
>
> I strongly prefer adding an appendix M for things that you can do with the
> ABNF that are almost, but not entirely unlike JSON.
>
> Grüße, Carsten
>
> PS.: I support jettisoning liturgical language in standards, and I applaud
> Douglas for slipping with respect to the liturgical term "octet" only twice
> (both in the same paragraph).  But a document must also speak the language
> of its editor, and if Douglas thinks "reverse solidus" is the best way to
> speak about what some Germans (but never me) would call "Rückschräger",
> that's fine, as long as it is consistent.
>
> PPS.: On the specific wording:
> > Before:
> >    A string is a sequence of zero or more Unicode characters [UNICODE].
> > After:
> >    A string is intended to contain sequences of zero or more Unicode
> characters [UNICODE 6.2]
>
> A string is a sequence of characters.  [not sequences of them]
> Add something like: "To reduce the burden on implementations, JSON is less
> selective in what it accepts as a character than Unicode itself is.  See
> also Appendix M."
>
> > Strings begin and end with quotation marks.
>
> The representation does, the string rarely does.
> RFC4627 got this right consistently, but in a tight language where
> extracting a single sentence may lose the necessary context.
>
> > They are intended,to contain sequences of Unicode characters; Note
> however that the ABNF in this section allows the inclusion of 16-bit
> quantities in ways which can never be useful for representing characters
> and is likely to cause breakage in software designed to process Unicode
> text.
>
> This is where I would simply point to Appendix M.
>
> > The ABNF allows the use of many Unicode code points that could be used
> in future to represent Unicode characters, but have not yet been assigned.
> Therefore, this specification should not need revision as the Unicode
> character repertoire continues to grow.
>
> This is something that even could be said in the introduction.
> Or in a section about stability and protocol evolution (the same section
> that is needed to say that [past and future] changes in JavaScript don't
> change JSON).
>
> > 16-bit quantities
>
> These are Unicode code points.
>
> > (normally Unicode characters from the Basic Multingual Pane(U+0000
> through U+FFFF) may be “escaped”, or represented as a six-character
> sequence: a backslash (U+005C REVERSE SOLIDUS), followed  by the lowercase
> letter u, followed by four hexadecimal digits that encode the character's
> code point.  The hexadecimal letters A though F can be upper or lower case.
>  So, for example, a string containing only a single backslash may be
> represented as "\u005C".
> >
> > Alternatively, there are two-character sequence escape representations
> of some popular characters.  So, for example, a string containing only a
> single backslash may be represented more compactly as "\\".
> >
> >  To escape an extended character
>
> Non-BMP characters aren't "extended".  They have rights, too!
> You MUST rid your mind of discriminating against them.
> (I know, this was just copied over...)
>
> > that is not in the Basic Multilingual Plane, the character is
> represented as a twelve-character sequence, encoding the UTF-16 surrogate
> pair.  So, for example, a string containing only U+1D11E G CLEF may be
> represented as
> >    "\uD834\uDD1E".
>
> (This could add a small apologetic clause pointing out the UTF-16 roots of
> the weird notation.  Or not.)
> This needs another pointer to appendix M.
>
>