Re: [Json] Using a non-whitespace separator (Re: Working Group Last Call on draft-ietf-json-text-sequence)

I suggested FF if RS was not acceptable. I made the RS suggestion
because it has the correct semantics for this role. (If you prefer
INFORMATION_SEPARATOR_ONE on the basis of this being UTF-8 - fine.) It
also seemed less likely to be accidentally emitted/transmitted than
FF. (I was also curious how many people would oppose it just for being
déclassé; ASCII ain't that big even before you write off swathes of
it.)

Could be worse, it could have been BEL.

FF is also barred from JSON texts, including in non-significant
whitespace: "whitespace characters are: character tabulation (U+0009),
line feed (U+000A), carriage return (U+000D), and space (U+0020)".
Pagers like "more" do pause on FF, but that behavior may be desirable
when you're dumping large, pretty-printed structures.

The accidental inclusion of the separator when cut & pasting is a
reasonable concern. It applies to FF as well as other control
characters though. As a sample, I looked at the Java code for the
json.org parser, Gson, and a parser I wrote: json.org accepts anything
before U+0020 as "whitespace"; Gson accepts only the four whitespace
characters specified (i.e. from what I could tell would reject FF as
well); I used Character.isWhitespace(char) which accepts RS and FF as
whitespace. So behavior here seems hard to rely on.

Ideally one would choose a printing resync character. But JSON
reserves all printing characters for use in strings. It also reserves
all non-ASCII Unicode characters. The *only* characters that work for
separating arbitrary JSON texts are ASCII control characters (and
excluding those mentioned above in the non-significant whitespace
section). All of them potentially suffer from cut & paste issues
because they are all of course illegal in JSON texts. FF is the only
"traditional" whitespace character in that set (i.e. typically matched
by \s in regular expressions) but as seen above, that does not mean
that it will be accepted by maximally-strict JSON parsers.

LF has many advantages beginning with compatibility with existing
line-oriented tools, at the cost of losing the possibility of
pretty-printing and requiring the removal of insignificant newlines
from input JSON texts (which as Nico says can be done mechanically,
without parsing them, so the difficulty is low). Pretty-printing could
be recovered pretty trivially in a filter or in tools. Also Nico is
the one with experience & code here and he says it works well.

However, if you use LF to separate, it should be required of emitters,
and other insignificant whitespace between or inside values should be
barred for emitters, to maximize the usefulness of existing
line-oriented tools. There should not be multiple versions of this
format with different options. I am guessing this will overwhelmingly
be a log format, and dealing with the possibility of truncation and
bad records is always important for logfiles.

On Thu, Jun 5, 2014 at 1:59 AM, Nico Williams <nico@cryptonector.com> wrote:
> On Thursday, June 5, 2014, Manger, James <James.H.Manger@team.telstra.com>
> wrote:
>>
>> >> JSON-sequence = *( ws %1e JSON-text )
>>
>> RS as a JSON sequence prefix or separator was a bad idea when discussed a
>> month ago and still is.
>
>
> I'm happy to go with the current I-D's LF-based recovery mechanism or
> variant of it.  If that's not acceptable then something like RS has to be
> it.
>
> A third alternative is to abandon logfiles as a use or accept truncated
> writes leading to fatal parse errors from the corrupted point forwards.  A
> fourth alternative is to not publish on the Standards track, maybe publish
> as Informational or Experimental whatever fewer of us think is best.  Before
> we get to any of those alternatives I'd like to try to get consensus.
>
> Remember, it's rough consensus and running code.  I have running code and
> I'm open to changing it, but if some views are mutually exclusive and
> therefore some have to be on the rough side of consensus, then so be it.
>
>
>>
>> * You cannot (easily) enter an RS in notepad.
>
>
> A reason to make it optional, as I proposed -- only logfile writers should
> have to emit it.
>
>>
>> * You cannot (easily) enter an RS in vi.
>
>
> Meh.  !printf '\x1E'.  Also, same answer as above.
>
>>
>> * You cannot see an RS.
>
>
> But that's harmless.  Especially if you know to expect to find it there.
>
>>
>> * An RS causes Chrome to treat a file as binary data, instead of text.
>
>
> That could get fixed (e.g., if Chrome learns about this new MIME type).
>
>>
>> * Cut-n-paste a JSON value with an invisible RS prefix and the result is
>> NOT JSON, ie it will fail with a JSON parser as RS is not allowed in JSON.
>
>
> But you can be careful to not cut the RS.  Or we could make it RS SP, to
> make it easier to find where to cut.
>
>>
>> * No one uses RS.
>
>
> That's not much of an argument :)
>
>>
>> * RS is now labelled INFORMATION SEPARATOR TWO, not RECORD SEPARATOR.
>
>
> Ditto.  And noted.
>
>
>>
>> * We aren't using INFORMATION SEPARATOR ONE, THREE or FOUR.
>
>
> Again.
>
>>
>> * A newline as a JSON value terminator is sufficient to parse a JSON
>> sequence unambiguously.
>
>
> Except in the presence of incompletely-written entries.
>
>>
>> * RS doesn't work well with APIs that read text by the line.
>
>
> Do you have any examples of such APIs?  Define "well".
>
>
>>
>> * Detecting a newline that separates JSON values is more complex than
>> detecting an RS character, but it is not that complex (eg handful of lines
>> of code).
>
>
> Maybe, and that is my preferred solution.  I'm with you there.
>
>>
>> * An RS prefix detects only slightly more cases of accidentally truncated
>> writes (in the middle of a top-level number, in a top-level string in the
>> middle of an escape sequence) -- not enough to be compelling.
>
>
> There are other cases if we have as a goal to recover starting at the very
> next full text following the truncated one.  Preceding and following texts
> with LF also helps recovery in the cases you mention.  But it's not enough.
> The current I-D covers some alternatives that do remove all ambiguities at
> little cost to either the parser or the encoder, including: removing
> internal newlines from texts (cheap operation; no need to parse and
> re-encode) and/or preceding texts with "null" LF.
>
>>
>> * The awkwardness of RS will mean many implementations will be lenient,
>> but leniency becomes "expected" which leads to interop problems.
>
>
> Parsers shouldn't require it.  Encoders should emit it if there's a chance
> of truncated writes.  How can interop problems arise from this formula?
> What am I missing?
>
>>
>> "A JSON sequence is the concatenation of zero or more JSON values, where
>> each JSON value is terminated with a newline."
>>
>> Simple to understand. Simple to write. Simple enough to parse. Simple
>> enough to resync from the middle of a sequence. Almost identical recovery
>> from accidental corruption is possible in almost all the same instances
>> regardless of whether an RS prefix or newline suffix is used.
>
>
> Yes, and I very much like that, right up until one wants to cater to
> logfiles and the truncated write problem.  See alternatives above.  We must
> choose one.  Two camps are squared off and I'm in the middle.  One will win
> or all may lose.
>
> My preference is to say that logfile writers must remove internal newlines
> from the texts they write.  That is by far the simplest fix for write
> truncation.   Not all sequences will be logfile-like.  No RS if we go with
> that.  I haven't seen any strong arguments against that, in fact  You have
> stronger arguments against RS than I have seen against internal newline
> removal by logfile writers.  Let's sleep on it and revisit tomorrow,
>
> Nico
> --
>
> _______________________________________________
> json mailing list
> json@ietf.org
> https://www.ietf.org/mailman/listinfo/json
>