Re: [Json] Using a non-whitespace separator (Re: Working Group Last Call on draft-ietf-json-text-sequence)

On Sun, May 25, 2014 at 4:05 PM, Nico Williams <nico@cryptonector.com> wrote:

> Currently my thinking is that for backwards compatibility reasons I'd
> want to to make this RECOMMENDED though, not REQUIRED, except for
> cases where incomplete writes are a potential problem.

No. There should be only one way to do things.

OK, I propose that the code point U+FFFE be used be used as the
separator in JSON sequences.  (This is the reversed form of the ZERO
WIDTH NO BREAK SPACE a.k.a. Byte Order Mark character; it means that
if you’re reading UTF-16 you have the endian-ness wrong).  Since
presumably by the time you see a separator you’ve figured out your
byte order, and especially since de facto everything is UTF-8, U+FFFE
just can’t occur. Also the Unicode spec is clear that it must never be
interpreted as an abstract character nor interchanged; and is thus
suitable for use as a separator.  This makes the resync problem
trivial: If you hit a busted JSON text, you drop into a loop like

while ((nextCodepoint() != 0xFFFE) && !eof()) {
  // do nothing
}

So the top-level production is along the lines of

JSON-sequence = JSON-text *( %xfffe JSON-text )

 In jq this
> would be an option to either use or maybe not use this new separator.
>
> Another option is to say that encoders MUST use the new separator, but
> parsers MAY/SHOULD/MUST handle sequences with a missing separator (as
> jq does; see below).  jq would still have an encoding option, but when
> not emitting the new separator the result just wouldn't be a JSON text
> sequence.
>
> FWIW, this is what the jq processor does to handle sequences: it reads
> input bytes, feeds them to its parser (which works incrementally, but
> isn't streaming), and passes each parsed output to the jq VM to use as
> an input to the jq program.  Output values of the jq program are
> encoded as JSON texts, printed, and then a newline is printed.
>
> The jq processor has no special handling of newlines on input.  If
> there's any bytes left over from parsing a previous text, they are
> used in the next parse.  Whitespace is just whitespace.
>
> The only special thing that the jq processor does is to print a
> newline after each text on output.
>
> This means that jq can handle JSON text sequences with any whitespace
> separator, and even no separator when there would be no ambiguity:
>
> % /jq -c .<<EOF
> 1 2 true false null"a string""another"[0,1,2
> ]{"foo":"bar"}
> EOF
> 1
> 2
> true
> false
> null
> "a string"
> "another"
> [0,1,2]
> {"foo":"bar"}
> %
>
> I could teach jq how to parse a non-whitespace control character
> separator; that's easy enough.  The question is: how to handle
> backwards compatibility?  The obvious answer is: add an option.  But
> which way should it default?
>
> Nico
> --
>
> _______________________________________________
> json mailing list
> json@ietf.org
> https://www.ietf.org/mailman/listinfo/json

-- 
- Tim Bray (If you’d like to send me a private message, see
https://keybase.io/timbray)