Re: [Json] [Technical Errata Reported] RFC8259 (7603)

Tim Bray <tbray@textuality.com> Mon, 14 August 2023 06:33 UTC

Return-Path: <tbray@textuality.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 454F4C14CE5F for <json@ietfa.amsl.com>; Sun, 13 Aug 2023 23:33:28 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.105
X-Spam-Level:
X-Spam-Status: No, score=-7.105 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=textuality.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id FUwr4i_ErbhM for <json@ietfa.amsl.com>; Sun, 13 Aug 2023 23:33:23 -0700 (PDT)
Received: from mail-lf1-x131.google.com (mail-lf1-x131.google.com [IPv6:2a00:1450:4864:20::131]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 75DDAC14E513 for <json@ietf.org>; Sun, 13 Aug 2023 23:33:23 -0700 (PDT)
Received: by mail-lf1-x131.google.com with SMTP id 2adb3069b0e04-4fe3b86cec1so6026432e87.2 for <json@ietf.org>; Sun, 13 Aug 2023 23:33:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=textuality.com; s=google; t=1691994801; x=1692599601; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=8W4ZlvXhv5QNK1cofrwXTMMq+WbStS7Ttsh2MmNJMDg=; b=GdVvfhbtY57h4R1sn/18DATJt8P86ykwQtRiBW9tT7SOpQ0GqjvrncKGHGqE/IAkQm XORZv792VJPB/Q0AxNgrdb0hhbP4g9EHFFXYxQSKOwyrfc4GGjzeIvidVsyz2jNFTnRw 3dcDnqVj+XdCBIfFj95pUCl5gHf/u7eHh8eiw=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691994801; x=1692599601; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=8W4ZlvXhv5QNK1cofrwXTMMq+WbStS7Ttsh2MmNJMDg=; b=bVJCSzi/ihduLJk7Tfca2O2lqJ0SPqI63NeU7R4Iys59CPvxnqpp72iIc8uKgYOWaN hAvpi8upGpgmX2t+wQDxWoRpI5ouRk9zSe+x9xs13o2904V3UISQ6wpZf5dhtEqYnDx4 NG8I96Bo1qbNISzmHAEofFcGLd+nEgtc3KGE4puOQEf1DiADggwhsyswQEDae1x15Koq sC8SXPiITrPOZZm9+ltoFfer9rSPlDCayzi4lsWrQFJMIjuT4eulPnYIYlpPAztNKzXK 7JomEIZ/IOVwS6/VM3s3YI5coNveB+QxjqdNXbXYfsxB+o08Ptpm3nLRwE5J33ODDKwB Xjpw==
X-Gm-Message-State: AOJu0YxkGN0Anr82XJz0uHO3fqm4o8+4oNLfNtsM4gliFrUvBBVX+pHX ZGVqpIErHZs2nMXwivPX1NWQx0dFwtqBHWNp6M6YPZlQXC7JGa0SFDA=
X-Google-Smtp-Source: AGHT+IEm2sKoexsvC2EWnKrvKYDpiw4eVWLhgl97ChMkvMsBqLTVerOZfJU9Oa/BPwDcWynMDP6ze1eklzVcdOp9hTA=
X-Received: by 2002:ac2:5f6e:0:b0:4fb:821e:2241 with SMTP id c14-20020ac25f6e000000b004fb821e2241mr4969012lfc.23.1691994801066; Sun, 13 Aug 2023 23:33:21 -0700 (PDT)
Received: from 1064022179695 named unknown by gmailapi.google.com with HTTPREST; Sun, 13 Aug 2023 23:33:20 -0700
Received: from 1064022179695 named unknown by gmailapi.google.com with HTTPREST; Sun, 13 Aug 2023 23:33:17 -0700
Mime-Version: 1.0 (Mimestream 1.0.5)
References: <20230813200941.250C13E8A7@rfcpa.amsl.com> <2E0F84CF-809D-4325-B60E-16FC2839E027@tzi.org>
In-Reply-To: <2E0F84CF-809D-4325-B60E-16FC2839E027@tzi.org>
From: Tim Bray <tbray@textuality.com>
Date: Sun, 13 Aug 2023 23:33:20 -0700
Message-ID: <CAHBU6itrQ3B1O=YSLRvZ1iP_nf+JpmdZipqwOhV_+3-VU58v8w@mail.gmail.com>
To: Carsten Bormann <cabo@tzi.org>
Cc: "Murray S. Kucherawy" <superuser@gmail.com>, Francesca Palombini <francesca.palombini@ericsson.com>, linuxwolf+ietf@outer-planes.net, Guillaume Fortin-Debigaré <guillaume.fortin@debigare.com>, json@ietf.org, RFC Errata System <rfc-editor@rfc-editor.org>
Content-Type: multipart/alternative; boundary="000000000000146ffa0602dc3d8b"
Archived-At: <https://mailarchive.ietf.org/arch/msg/json/KO08VJNUMIq-V0mjgRHuwTO3sMM>
Subject: Re: [Json] [Technical Errata Reported] RFC8259 (7603)
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/json/>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 14 Aug 2023 06:33:28 -0000

 I think the report is correct.

The erratum is that the RFC uses the term “Unicode character”, which
doesn’t have a straightforward definition in the Unicode spec that a
developer can look up. There are two useful definitions that might be used
here. “Unicode code point" (D10 in the spec), is an entry in the range of
integers from 0 to 10FFFF. “Unicode scalar value" (D76), any Unicode code
point except high-surrogate and low-surrogate code points.  The first known
JSON spec, preserved at json.org, is perfectly clear, "Any codepoint except
" or \ or control characters”. When the spec came over to IETF with RFC4627
it took a step backward, referring throughout to “Unicode characters”.
While it would be vastly preferable if JSON had restricted itself to
Unicode scalar values, it didn’t, and as far as I know, the good
implementations over the years have been perfectly happy to accept strings
with unpaired surrogates. When IETF JSON progressed through 7159 to 8259,
the regrettable use of “Unicode characters” was allowed to persist - as
editor, that’d be my fault, sorry - although the RFC did a reasonably good
job of pointing out the problem and recommending against solo surrogates. I
can’t recall if the WG considered the issue, but in any case it didn’t
decide to rewrite that particular bit of history.

If you want JSON with only Unicode scalar values, I-JSON (RFC7493) has what
you need: https://www.rfc-editor.org/rfc/rfc7493.html#section-2.1

IETF-specified protocols should without exception require I-JSON.

But in the real world JSON strings contain any old combination of Unicode
code points, as described in the report.


On Aug 13, 2023 at 10:21:18 PM, Carsten Bormann <cabo@tzi.org> wrote:

> When the IETF is moving forward a document from Proposed Standard to
> Internet Standard, it usually considers which parts of the specification
> have generated useful interoperability and which ones possibly didn’t.
>
> Unfortunately, we were not entirely free to do this, as there was a
> political drive to align with ECMA JSON (ECMA 404), and we wanted be able
> to say:
>
> > there are no
>
> > inconsistencies in the definition of the term "JSON text" in any of
>
> > its specifications.
>
>
> This led to this beautiful note:
>
> > Note, however, that ECMA-404 allows several
>
> > practices that this specification recommends avoiding in the
>
> > interests of maximal interoperability.
>
>
> JSON is based on Unicode; unlike XML there is no choice of other character
> sets.
> The one obvious thing that was fixed on the way to Internet Standard was
> to nail down that the interchange of that Unicode happens in UTF-8 in JSON
> (Section 8.1).
>
> Section 8.2 dances around the fact that ECMAScript is based on a legacy
> 16-bit Unicode character model, which leads to surrogate characters
> appearing during certain forms of string processing.  Worse, this character
> model has occasionally been exploited to represent arbitrary binary data in
> JSON text strings.  The ABNF in RFC 8259 therefore “allows” certain bit
> combinations that lead to invalid Unicode representations, but also
> explains that their interchange creates behavior that is “unpredictable”.
>   This was the politically acceptable way to express the working group view
> that these representations are not part of proper JSON interchange, but are
> still “allowed” by the ABNF grammar supplied.
>
> Original Text
>
> -------------
>
>   A string is a sequence of zero or more Unicode characters [UNICODE].
>
>
> Corrected Text
>
> --------------
>
>   A string is a sequence of zero or more Unicode code points [UNICODE].
>
>
> Any attempt to make RFC 8259 more about representing the damage done to it
> by the ECMAScript legacy 16-bit Unicode character model instead of the
> interchange of clean JSON documents would have been viewed very dimly by
> the JSON WG.
>
> This errata report may not intentionally attempt to work around the WG
> consensus that led to RFC 8259, but its acceptance would very much
> effectively do that.
>
> This errata report, and any other changes that would effectively turn back
> the clock on JSON, must be rejected.
>
> Grüße, Carsten
>
>