Re: [Json] [Technical Errata Reported] RFC8259 (7603)

Tim Bray <tbray@textuality.com> Mon, 14 August 2023 06:33 UTC

Mime-Version: 1.0 (Mimestream 1.0.5)
References: <20230813200941.250C13E8A7@rfcpa.amsl.com> <2E0F84CF-809D-4325-B60E-16FC2839E027@tzi.org>
In-Reply-To: <2E0F84CF-809D-4325-B60E-16FC2839E027@tzi.org>
From: Tim Bray <tbray@textuality.com>
Date: Sun, 13 Aug 2023 23:33:20 -0700
Message-ID: <CAHBU6itrQ3B1O=YSLRvZ1iP_nf+JpmdZipqwOhV_+3-VU58v8w@mail.gmail.com>
To: Carsten Bormann <cabo@tzi.org>
Cc: "Murray S. Kucherawy" <superuser@gmail.com>, Francesca Palombini <francesca.palombini@ericsson.com>, linuxwolf+ietf@outer-planes.net, Guillaume Fortin-Debigaré <guillaume.fortin@debigare.com>, json@ietf.org, RFC Errata System <rfc-editor@rfc-editor.org>
Content-Type: multipart/alternative; boundary="000000000000146ffa0602dc3d8b"
Archived-At: <https://mailarchive.ietf.org/arch/msg/json/KO08VJNUMIq-V0mjgRHuwTO3sMM>
Subject: Re: [Json] [Technical Errata Reported] RFC8259 (7603)
Precedence: list

 I think the report is correct.

The erratum is that the RFC uses the term “Unicode character”, which
doesn’t have a straightforward definition in the Unicode spec that a
developer can look up. There are two useful definitions that might be used
here. “Unicode code point" (D10 in the spec), is an entry in the range of
integers from 0 to 10FFFF. “Unicode scalar value" (D76), any Unicode code
point except high-surrogate and low-surrogate code points.  The first known
JSON spec, preserved at json.org, is perfectly clear, "Any codepoint except
" or \ or control characters”. When the spec came over to IETF with RFC4627
it took a step backward, referring throughout to “Unicode characters”.
While it would be vastly preferable if JSON had restricted itself to
Unicode scalar values, it didn’t, and as far as I know, the good
implementations over the years have been perfectly happy to accept strings
with unpaired surrogates. When IETF JSON progressed through 7159 to 8259,
the regrettable use of “Unicode characters” was allowed to persist - as
editor, that’d be my fault, sorry - although the RFC did a reasonably good
job of pointing out the problem and recommending against solo surrogates. I
can’t recall if the WG considered the issue, but in any case it didn’t
decide to rewrite that particular bit of history.

If you want JSON with only Unicode scalar values, I-JSON (RFC7493) has what
you need: https://www.rfc-editor.org/rfc/rfc7493.html#section-2.1

IETF-specified protocols should without exception require I-JSON.

But in the real world JSON strings contain any old combination of Unicode
code points, as described in the report.

On Aug 13, 2023 at 10:21:18 PM, Carsten Bormann <cabo@tzi.org> wrote:

> When the IETF is moving forward a document from Proposed Standard to
> Internet Standard, it usually considers which parts of the specification
> have generated useful interoperability and which ones possibly didn’t.
>
> Unfortunately, we were not entirely free to do this, as there was a
> political drive to align with ECMA JSON (ECMA 404), and we wanted be able
> to say:
>
> > there are no
>
> > inconsistencies in the definition of the term "JSON text" in any of
>
> > its specifications.
>
>
> This led to this beautiful note:
>
> > Note, however, that ECMA-404 allows several
>
> > practices that this specification recommends avoiding in the
>
> > interests of maximal interoperability.
>
>
> JSON is based on Unicode; unlike XML there is no choice of other character
> sets.
> The one obvious thing that was fixed on the way to Internet Standard was
> to nail down that the interchange of that Unicode happens in UTF-8 in JSON
> (Section 8.1).
>
> Section 8.2 dances around the fact that ECMAScript is based on a legacy
> 16-bit Unicode character model, which leads to surrogate characters
> appearing during certain forms of string processing.  Worse, this character
> model has occasionally been exploited to represent arbitrary binary data in
> JSON text strings.  The ABNF in RFC 8259 therefore “allows” certain bit
> combinations that lead to invalid Unicode representations, but also
> explains that their interchange creates behavior that is “unpredictable”.
>   This was the politically acceptable way to express the working group view
> that these representations are not part of proper JSON interchange, but are
> still “allowed” by the ABNF grammar supplied.
>
> Original Text
>
> -------------
>
>   A string is a sequence of zero or more Unicode characters [UNICODE].
>
>
> Corrected Text
>
> --------------
>
>   A string is a sequence of zero or more Unicode code points [UNICODE].
>
>
> Any attempt to make RFC 8259 more about representing the damage done to it
> by the ECMAScript legacy 16-bit Unicode character model instead of the
> interchange of clean JSON documents would have been viewed very dimly by
> the JSON WG.
>
> This errata report may not intentionally attempt to work around the WG
> consensus that led to RFC 8259, but its acceptance would very much
> effectively do that.
>
> This errata report, and any other changes that would effectively turn back
> the clock on JSON, must be rejected.
>
> Grüße, Carsten
>
>

[Json] [Technical Errata Reported] RFC8259 (7603) RFC Errata System
Re: [Json] [Technical Errata Reported] RFC8259 (7… Rob Sayre
Re: [Json] [Technical Errata Reported] RFC8259 (7… Carsten Bormann
Re: [Json] [Technical Errata Reported] RFC8259 (7… Tim Bray
Re: [Json] [Technical Errata Reported] RFC8259 (7… Carsten Bormann
Re: [Json] [Technical Errata Reported] RFC8259 (7… Rob Sayre
Re: [Json] [Technical Errata Reported] RFC8259 (7… Carsten Bormann
Re: [Json] [Technical Errata Reported] RFC8259 (7… Carsten Bormann
Re: [Json] [Technical Errata Reported] RFC8259 (7… Rob Sayre
Re: [Json] [Technical Errata Reported] RFC8259 (7… Carsten Bormann
Re: [Json] [Technical Errata Reported] RFC8259 (7… Tim Bray
Re: [Json] [Technical Errata Reported] RFC8259 (7… Rob Sayre
Re: [Json] [Technical Errata Reported] RFC8259 (7… Tim Bray
Re: [Json] [Technical Errata Reported] RFC8259 (7… Rob Sayre
Re: [Json] [Technical Errata Reported] RFC8259 (7… Guillaume Fortin-Debigaré
Re: [Json] [Technical Errata Reported] RFC8259 (7… Joe Hildebrand
Re: [Json] [Technical Errata Reported] RFC8259 (7… Rob Sayre
Re: [Json] [Technical Errata Reported] RFC8259 (7… Tim Bray
Re: [Json] [Technical Errata Reported] RFC8259 (7… Rob Sayre
Re: [Json] [Technical Errata Reported] RFC8259 (7… Carsten Bormann
Re: [Json] [Technical Errata Reported] RFC8259 (7… Rob Sayre
Re: [Json] [Technical Errata Reported] RFC8259 (7… Manger, James
Re: [Json] [Technical Errata Reported] RFC8259 (7… Rob Sayre
Re: [Json] [Technical Errata Reported] RFC8259 (7… Carsten Bormann
Re: [Json] [Technical Errata Reported] RFC8259 (7… Tim Bray
Re: [Json] [Technical Errata Reported] RFC8259 (7… Rob Sayre
Re: [Json] [Technical Errata Reported] RFC8259 (7… Carsten Bormann
Re: [Json] [Technical Errata Reported] RFC8259 (7… Rob Sayre
Re: [Json] [Technical Errata Reported] RFC8259 (7… Tim Bray
Re: [Json] [Technical Errata Reported] RFC8259 (7… Rob Sayre
Re: [Json] [Technical Errata Reported] RFC8259 (7… Carsten Bormann
Re: [Json] [Technical Errata Reported] RFC8259 (7… Tim Bray
Re: [Json] [Technical Errata Reported] RFC8259 (7… Carsten Bormann
Re: [Json] [Technical Errata Reported] RFC8259 (7… Rob Sayre
Re: [Json] [Technical Errata Reported] RFC8259 (7… Tim Bray
Re: [Json] [Technical Errata Reported] RFC8259 (7… Carsten Bormann
Re: [Json] [Technical Errata Reported] RFC8259 (7… Tim Bray