Re: [Json] Using a non-whitespace separator (Re: Working Group Last Call on draft-ietf-json-text-sequence)

Tim Bray <tbray@textuality.com> Sun, 01 June 2014 05:09 UTC

Return-Path: <tbray@textuality.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 894EC1A0177 for <json@ietfa.amsl.com>; Sat, 31 May 2014 22:09:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.778
X-Spam-Level:
X-Spam-Status: No, score=-0.778 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, FM_FORGED_GMAIL=0.622, J_CHICKENPOX_14=0.6, J_CHICKENPOX_41=0.6, RCVD_IN_DNSWL_LOW=-0.7] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id M6fggUmT8FpU for <json@ietfa.amsl.com>; Sat, 31 May 2014 22:09:15 -0700 (PDT)
Received: from mail-vc0-f181.google.com (mail-vc0-f181.google.com [209.85.220.181]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 260D81A0176 for <json@ietf.org>; Sat, 31 May 2014 22:09:14 -0700 (PDT)
Received: by mail-vc0-f181.google.com with SMTP id hq11so1815945vcb.12 for <json@ietf.org>; Sat, 31 May 2014 22:09:09 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type:content-transfer-encoding; bh=U9zX4y8hRmw1EepmjDDQKz6gebwG3zcKIfPTcTzvmkk=; b=HhJnSzNdbI6paSY50hINT1GFvIc4MBiTvX9131kzBqBDQuA3R0FT8IVax8rdqhoQlb Se4hTZWs9q0DWP8aAQKyEeBe7GJGgWBJL/F4twTlDeJ8AN703o767lp+K4PvFLNWul+s FK3fJOyeMIkM5KZvyz91qMvPVHRrEHU2HsyOYfdNFTsDJt4NP4VFTUt/wDwhIo8nWKl7 olQ/Ol1XRaCNEE0bLpSzxZNqkCrltJaSpsRKCrxG7RR5AuayHWwsUa1yrbX4MyVJ2Y7g 4QNRbLjBJ7aWeEXV++fdOOKhSEXNNPZcs5CcWf79///YCqm2ASwupR7tQs4LP8mbQUyd aS6g==
X-Gm-Message-State: ALoCoQntmrl01sZ7B7HppZJSp2gXxQ//GstMHUSyRXh0SVwCVAwSStwWdHWv+1vZEAQRil0HDO4j
X-Received: by 10.221.44.73 with SMTP id uf9mr23349374vcb.9.1401599349437; Sat, 31 May 2014 22:09:09 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.220.98.73 with HTTP; Sat, 31 May 2014 22:08:49 -0700 (PDT)
X-Originating-IP: [24.84.235.32]
In-Reply-To: <CAK3OfOidgk13ShPzpF-cxBHeg34s99CHs=bpY1rW-yBwnpPC-g@mail.gmail.com>
References: <CAK3OfOidgk13ShPzpF-cxBHeg34s99CHs=bpY1rW-yBwnpPC-g@mail.gmail.com>
From: Tim Bray <tbray@textuality.com>
Date: Sat, 31 May 2014 22:08:49 -0700
Message-ID: <CAHBU6itr=ogxP4uoj57goEUSOCpsRx1AXVnW1NQwSTPxbbttkw@mail.gmail.com>
To: Nico Williams <nico@cryptonector.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: http://mailarchive.ietf.org/arch/msg/json/xVdQtDQ1KqC6NcbiQ4qHV-jQavw
Cc: Carsten Bormann <cabo@tzi.org>, IETF JSON WG <json@ietf.org>, "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, Paul Hoffman <paul.hoffman@vpnc.org>
Subject: Re: [Json] Using a non-whitespace separator (Re: Working Group Last Call on draft-ietf-json-text-sequence)
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json/>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 01 Jun 2014 05:09:16 -0000

On Sun, May 25, 2014 at 4:05 PM, Nico Williams <nico@cryptonector.com> wrote:

> Currently my thinking is that for backwards compatibility reasons I'd
> want to to make this RECOMMENDED though, not REQUIRED, except for
> cases where incomplete writes are a potential problem.

No. There should be only one way to do things.

OK, I propose that the code point U+FFFE be used be used as the
separator in JSON sequences.  (This is the reversed form of the ZERO
WIDTH NO BREAK SPACE a.k.a. Byte Order Mark character; it means that
if you’re reading UTF-16 you have the endian-ness wrong).  Since
presumably by the time you see a separator you’ve figured out your
byte order, and especially since de facto everything is UTF-8, U+FFFE
just can’t occur. Also the Unicode spec is clear that it must never be
interpreted as an abstract character nor interchanged; and is thus
suitable for use as a separator.  This makes the resync problem
trivial: If you hit a busted JSON text, you drop into a loop like

while ((nextCodepoint() != 0xFFFE) && !eof()) {
  // do nothing
}

So the top-level production is along the lines of

JSON-sequence = JSON-text *( %xfffe JSON-text )



 In jq this
> would be an option to either use or maybe not use this new separator.
>
> Another option is to say that encoders MUST use the new separator, but
> parsers MAY/SHOULD/MUST handle sequences with a missing separator (as
> jq does; see below).  jq would still have an encoding option, but when
> not emitting the new separator the result just wouldn't be a JSON text
> sequence.
>
> FWIW, this is what the jq processor does to handle sequences: it reads
> input bytes, feeds them to its parser (which works incrementally, but
> isn't streaming), and passes each parsed output to the jq VM to use as
> an input to the jq program.  Output values of the jq program are
> encoded as JSON texts, printed, and then a newline is printed.
>
> The jq processor has no special handling of newlines on input.  If
> there's any bytes left over from parsing a previous text, they are
> used in the next parse.  Whitespace is just whitespace.
>
> The only special thing that the jq processor does is to print a
> newline after each text on output.
>
> This means that jq can handle JSON text sequences with any whitespace
> separator, and even no separator when there would be no ambiguity:
>
> % /jq -c .<<EOF
> 1 2 true false null"a string""another"[0,1,2
> ]{"foo":"bar"}
> EOF
> 1
> 2
> true
> false
> null
> "a string"
> "another"
> [0,1,2]
> {"foo":"bar"}
> %
>
> I could teach jq how to parse a non-whitespace control character
> separator; that's easy enough.  The question is: how to handle
> backwards compatibility?  The obvious answer is: add an option.  But
> which way should it default?
>
> Nico
> --
>
> _______________________________________________
> json mailing list
> json@ietf.org
> https://www.ietf.org/mailman/listinfo/json



-- 
- Tim Bray (If you’d like to send me a private message, see
https://keybase.io/timbray)