Re: [Json] Scope: Wire format or runtime format?

Norbert Lindenberg <ietf@lindenbergsoftware.com> Mon, 17 June 2013 08:21 UTC

Return-Path: <ietf@lindenbergsoftware.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id DEE2921F8FF3 for <json@ietfa.amsl.com>; Mon, 17 Jun 2013 01:21:28 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.169
X-Spam-Level:
X-Spam-Status: No, score=-3.169 tagged_above=-999 required=5 tests=[AWL=-0.470, BAYES_00=-2.599, J_CHICKENPOX_45=0.6, MIME_8BIT_HEADER=0.3, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id N6LgFtBN8kmh for <json@ietfa.amsl.com>; Mon, 17 Jun 2013 01:21:23 -0700 (PDT)
Received: from mirach.lunarpages.com (mirach.lunarpages.com [216.97.235.70]) by ietfa.amsl.com (Postfix) with ESMTP id A287121F9634 for <json@ietf.org>; Mon, 17 Jun 2013 01:21:09 -0700 (PDT)
Received: from 50-0-136-241.dsl.dynamic.sonic.net ([50.0.136.241]:60551 helo=[192.168.0.5]) by mirach.lunarpages.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.80) (envelope-from <ietf@lindenbergsoftware.com>) id 1UoUgU-002DYL-FA; Mon, 17 Jun 2013 01:21:06 -0700
Mime-Version: 1.0 (Apple Message framework v1283)
Content-Type: text/plain; charset=iso-8859-1
From: Norbert Lindenberg <ietf@lindenbergsoftware.com>
In-Reply-To: <51BE8DEA.7030307@it.aoyama.ac.jp>
Date: Mon, 17 Jun 2013 01:21:02 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <118A7923-5D66-4EF5-8A78-CAE125032010@lindenbergsoftware.com>
References: <6FC6B441-B74D-4B9F-B883-065C05890880@lindenbergsoftware.com> <51BE8DEA.7030307@it.aoyama.ac.jp>
To: =?iso-8859-1?Q?=22Martin_J=2E_D=FCrst=22?= <duerst@it.aoyama.ac.jp>
X-Mailer: Apple Mail (2.1283)
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - mirach.lunarpages.com
X-AntiAbuse: Original Domain - ietf.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - lindenbergsoftware.com
X-Get-Message-Sender-Via: mirach.lunarpages.com: authenticated_id: ietf@lindenbergsoftware.com
Cc: Norbert Lindenberg <ietf@lindenbergsoftware.com>, json@ietf.org
Subject: Re: [Json] Scope: Wire format or runtime format?
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 17 Jun 2013 08:21:30 -0000

Hi Martin,

On Jun 16, 2013, at 21:17 , Martin J. Dürst wrote:

> Hello Norbert,
> 
> On 2013/06/14 7:47, Norbert Lindenberg wrote:
>> In looking over older messages on this list, I found a message that made clear to me why we're having this endless discussion about Unicode surrogates - it's because we're not clear whether we're designing a wire format or a format that also for use at runtime:
>> http://www.ietf.org/mail-archive/web/json/current/msg00355.html
> 
> This is kind of the first time I have heard the term "format for use at runtime" in this context. Of course there are formats used at runtime (internal representations of number and strings, for example), but I think the term "runtime format" only confuses our discussion. In my understanding, JavaScript does not have or use JSON as a runtime format. If that has changed, then please tell us.

Maybe you have a better term for it: What I mean in the ECMAScript context is the format that the functions JSON.parse and JSON.stringify parse and produce [1]. The format is based on the JSON grammar, represented in ECMAScript string values, that is, sequences of UTF-16 code units. ECMAScript doesn't specify wire formats (it doesn't even have any sort of I/O system), and has no control over what applications do with the strings before passing them to parse() or after obtaining them from stringify(). In many cases applications will get them from or send them to some I/O system (such as XMLHttpRequest in the browser or the file system module in Node.js), but they might modify them in between, or do something completely different.

>> Some people are coming from the runtime point of view, especially ECMAScript, where it's accepted practice to use ill-formed UTF-16 or even non-text in strings. At least the ill-formed UTF-16 is legitimized by section 2.7 of the Unicode standard.
> 
> "Accepted practice" is probably going a bit too far, and giving the wrong impression. My understanding is that the Unicode standard accepts this for efficiency reasons, not because it's in anyway inherently useful. For ECMAScript, we have to add history to efficiency, but still I hope it's considered bad practice to actually use ill-formed UTF-16.

"Grudgingly accepted practice"? Unfortunately, considering it bad practice to use ill-formed UTF-16 or to represent JPEGs as string values doesn't make it go away.

>> Other people are coming from the wire protocol point of view, where clean formats are expected, in particular well-formed Unicode code unit sequences according to section 3.9 of the Unicode standard.
>> 
>> So which one shall it be?
>> 
>> If we adopt the wire protocol point of view
> 
> To me, it's very clear that we are describing (we are not designing; that has happened a long time ago, and very implicitly) a wire format, or something close to a wire format (somebody mentioned JSON embedded in an (e.g.) EUC-JP HTML page (*)).
> 

>> and require well-formed code unit sequences,
> 
> For practical reasons, I think we shouldn't go there. But we should make it very clear (in the base spec, not relegated to some "best practices" document) that using lone surrogates is a bad idea, and that senders and receivers MAY reject such data (e.g. because of security concerns).

John Cowan pointed to RFC 3629, which prohibits the use of surrogate code points in UTF-8 [2]. Do you think that even for a wire format we can ignore that prohibition?

>> then ECMAScript will have to define its own extension of JSON
> 
> I very much hope we can avoid that. I very much hope that ECMAScript can tolerate that lone surrogates are often if not always a bad idea, even if they may sometimes happen for historical and efficiency reasons.

The ECMAScript specification generally doesn't discuss good or bad ideas; it just specifies the behavior of implementations. The specified behavior is that unpaired surrogates in strings are passed through by both parse() and stringify(), and parse() unescapes escape sequences for all surrogate values, paired or not.

Personally, I agree that unpaired surrogates are in most cases a bad idea.

[1] http://ecma-international.org/ecma-262/5.1/#sec-15.12
[2] http://www.ietf.org/mail-archive/web/json/current/msg00831.html

Regards,
Norbert