Re: [Json] Proposal for strings/Unicode text

Tim Bray <tbray@textuality.com> Wed, 12 June 2013 22:50 UTC

MIME-Version: 1.0
In-Reply-To: <3E2A194C-789F-4E75-ABB0-CE966319463E@tzi.org>
References: <CAHBU6ivNjMUwN2Hsn-E8FKxjqXS6b4qz=_MeeaHahWBWqG_Hgg@mail.gmail.com> <3E2A194C-789F-4E75-ABB0-CE966319463E@tzi.org>
Date: Wed, 12 Jun 2013 15:49:56 -0700
Message-ID: <CAHBU6ivw=4WfTyXdBns-i30fvzhkb+Zs_puj=YhFw+fh6n3R7A@mail.gmail.com>
From: Tim Bray <tbray@textuality.com>
To: Carsten Bormann <cabo@tzi.org>
Content-Type: multipart/alternative; boundary="20cf307ca1e6337c3c04defcd5d5"
Cc: "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Proposal for strings/Unicode text
Precedence: list

Revised per Carsten's  and Norbert’s input. I’m unconvinced about Appendix
M since I think the “16-bit quantities” para says what needs saying, but he
caught a few bugs.

Before:
   A string is a sequence of zero or more Unicode characters [UNICODE].
After:
   A string is intended to contain a sequence of zero or more Unicode
characters [UNICODE 6.2]

Rewrite section 2.5 as follows:

Strings are delimited  with quotation marks (U+0022 QUOTATION MARK).  They
are intended,to contain sequences of Unicode characters.  Note however that
the normative ABNF in this section allows the inclusion of 16-bit
quantities, for example unpaired surrogate-block code points, in ways which
can never be useful for representing characters and is likely to cause
breakage in software designed to process Unicode text.

The ABNF allows the use of many Unicode code points that could be used in
future to represent Unicode characters, but have not yet been assigned.
Therefore, this specification should not need revision as the Unicode
character repertoire continues to grow.

16-bit quantities, normally representing Unicode characters from the Basic
Multingual Pane (U+0000 through U+FFFF), may be “escaped”, or represented
as a six-character sequence: a backslash (U+005C REVERSE SOLIDUS),
followed  by the lowercase letter u, followed by four hexadecimal digits
that encode the character's code point.  The hexadecimal letters A though F
can be upper or lower case.  So, for example, a string containing only a
single backslash may be represented as "\u005C".

Alternatively, there are two-character sequence escape representations of
some popular characters.  So, for example, a string containing only a
single backslash may be represented more compactly as "\\".

 To escape a non-BMP character that is not in the Basic Multilingual Plane,
the character is represented as a twelve-character sequence, encoding the
UTF-16 surrogate pair.  So, for example, a string containing only U+1D11E G
CLEF may be represented as
   "\uD834\uDD1E".

=== insert ABNF here ====

On Wed, Jun 12, 2013 at 3:30 PM, Carsten Bormann <cabo@tzi.org> wrote:

> Hmm.
>
> Somehow I think the JSON specification should focus on describing what the
> intended usage is.
>
> I strongly prefer adding an appendix M for things that you can do with the
> ABNF that are almost, but not entirely unlike JSON.
>
> Grüße, Carsten
>
> PS.: I support jettisoning liturgical language in standards, and I applaud
> Douglas for slipping with respect to the liturgical term "octet" only twice
> (both in the same paragraph).  But a document must also speak the language
> of its editor, and if Douglas thinks "reverse solidus" is the best way to
> speak about what some Germans (but never me) would call "Rückschräger",
> that's fine, as long as it is consistent.
>
> PPS.: On the specific wording:
> > Before:
> >    A string is a sequence of zero or more Unicode characters [UNICODE].
> > After:
> >    A string is intended to contain sequences of zero or more Unicode
> characters [UNICODE 6.2]
>
> A string is a sequence of characters.  [not sequences of them]
> Add something like: "To reduce the burden on implementations, JSON is less
> selective in what it accepts as a character than Unicode itself is.  See
> also Appendix M."
>
> > Strings begin and end with quotation marks.
>
> The representation does, the string rarely does.
> RFC4627 got this right consistently, but in a tight language where
> extracting a single sentence may lose the necessary context.
>
> > They are intended,to contain sequences of Unicode characters; Note
> however that the ABNF in this section allows the inclusion of 16-bit
> quantities in ways which can never be useful for representing characters
> and is likely to cause breakage in software designed to process Unicode
> text.
>
> This is where I would simply point to Appendix M.
>
> > The ABNF allows the use of many Unicode code points that could be used
> in future to represent Unicode characters, but have not yet been assigned.
> Therefore, this specification should not need revision as the Unicode
> character repertoire continues to grow.
>
> This is something that even could be said in the introduction.
> Or in a section about stability and protocol evolution (the same section
> that is needed to say that [past and future] changes in JavaScript don't
> change JSON).
>
> > 16-bit quantities
>
> These are Unicode code points.
>
> > (normally Unicode characters from the Basic Multingual Pane(U+0000
> through U+FFFF) may be “escaped”, or represented as a six-character
> sequence: a backslash (U+005C REVERSE SOLIDUS), followed  by the lowercase
> letter u, followed by four hexadecimal digits that encode the character's
> code point.  The hexadecimal letters A though F can be upper or lower case.
>  So, for example, a string containing only a single backslash may be
> represented as "\u005C".
> >
> > Alternatively, there are two-character sequence escape representations
> of some popular characters.  So, for example, a string containing only a
> single backslash may be represented more compactly as "\\".
> >
> >  To escape an extended character
>
> Non-BMP characters aren't "extended".  They have rights, too!
> You MUST rid your mind of discriminating against them.
> (I know, this was just copied over...)
>
> > that is not in the Basic Multilingual Plane, the character is
> represented as a twelve-character sequence, encoding the UTF-16 surrogate
> pair.  So, for example, a string containing only U+1D11E G CLEF may be
> represented as
> >    "\uD834\uDD1E".
>
> (This could add a small apologetic clause pointing out the UTF-16 roots of
> the weird notation.  Or not.)
> This needs another pointer to appendix M.
>
>

[Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Carsten Bormann
Re: [Json] Proposal for strings/Unicode text Norbert Lindenberg
Re: [Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Carsten Bormann
Re: [Json] Proposal for strings/Unicode text R S
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Bjoern Hoehrmann
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Joe Hildebrand (jhildebr)
Re: [Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Nico Williams
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Manger, James H
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text Joe Hildebrand (jhildebr)
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text Joe Hildebrand (jhildebr)
Re: [Json] Proposal for strings/Unicode text Carsten Bormann
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text Norbert Lindenberg
Re: [Json] Proposal for strings/Unicode text Norbert Lindenberg