Re: [Json] draft-ietf-json-text-sequence-01 comments

Nico Williams <nico@cryptonector.com> Fri, 09 May 2014 17:22 UTC

Return-Path: <nico@cryptonector.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C49341A0066 for <json@ietfa.amsl.com>; Fri, 9 May 2014 10:22:04 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.044
X-Spam-Level:
X-Spam-Status: No, score=-1.044 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FM_FORGED_GMAIL=0.622, IP_NOT_FRIENDLY=0.334] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id tTBKovuYuSZP for <json@ietfa.amsl.com>; Fri, 9 May 2014 10:22:03 -0700 (PDT)
Received: from homiemail-a67.g.dreamhost.com (sub4.mail.dreamhost.com [69.163.253.135]) by ietfa.amsl.com (Postfix) with ESMTP id A5E521A005E for <json@ietf.org>; Fri, 9 May 2014 10:22:03 -0700 (PDT)
Received: from homiemail-a67.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a67.g.dreamhost.com (Postfix) with ESMTP id A1C6827BC064 for <json@ietf.org>; Fri, 9 May 2014 10:21:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=cryptonector.com; h= mime-version:in-reply-to:references:date:message-id:subject:from :to:cc:content-type:content-transfer-encoding; s= cryptonector.com; bh=gT736F/GIa/d5YmS3qmZORoF334=; b=t+oE45kvOIp 9+uOJCURqa+uqrwklDycJ8jxN+rmNzXjmIAe3qjyVkpibVlrP8WudkoKVjJTFI5u PY1qv8I60XkfS1a2pm406ROt7+y02qpJfQyo5n3kFObRmcr4UMJrEo95w0SVKF9d 28xLfZZ8HvKn/0AgIVyPks7s36r+qlh0=
Received: from mail-we0-f176.google.com (mail-we0-f176.google.com [74.125.82.176]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: nico@cryptonector.com) by homiemail-a67.g.dreamhost.com (Postfix) with ESMTPSA id 5589527BC05D for <json@ietf.org>; Fri, 9 May 2014 10:21:58 -0700 (PDT)
Received: by mail-we0-f176.google.com with SMTP id q59so4272204wes.35 for <json@ietf.org>; Fri, 09 May 2014 10:21:57 -0700 (PDT)
MIME-Version: 1.0
X-Received: by 10.194.119.34 with SMTP id kr2mr9651939wjb.34.1399656117225; Fri, 09 May 2014 10:21:57 -0700 (PDT)
Received: by 10.216.29.200 with HTTP; Fri, 9 May 2014 10:21:57 -0700 (PDT)
In-Reply-To: <255B9BB34FB7D647A506DC292726F6E11545D24F7B@WSMSG3153V.srv.dir.telstra.com>
References: <255B9BB34FB7D647A506DC292726F6E11545D24F7B@WSMSG3153V.srv.dir.telstra.com>
Date: Fri, 09 May 2014 12:21:57 -0500
Message-ID: <CAK3OfOjUgzF4B=rBqmKx2T9SBpA1hgS4Axkw6OC9GXbuaUjGnw@mail.gmail.com>
From: Nico Williams <nico@cryptonector.com>
To: "Manger, James" <James.H.Manger@team.telstra.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: http://mailarchive.ietf.org/arch/msg/json/0vAcNdZjUaZ-svBZCSApAMJgvTQ
Cc: "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] draft-ietf-json-text-sequence-01 comments
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json/>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 09 May 2014 17:22:05 -0000

On Fri, May 9, 2014 at 3:03 AM, Manger, James
<James.H.Manger@team.telstra.com> wrote:
> Comments on draft-ietf-json-text-sequence-01
>
> Section 2.
>
> The reason to NOT REQUIRE a newline: even if the spec says newlines are required some implementations may accept other sequences when they are unambiguous (such as {"a":1}{"b":2}); some people will rely on that leniency and omit newlines; leading to interop problems with implementations that require newlines as per the spec.
>
> A reason to REQUIRE a newline: it makes it feasible to find the start of the next JSON value from anywhere within a sequence (ie resync).

I think that's right, sequence encoders should be REQUIRED to emit a
newline after any JSON text (and possibly other whitespace).  (FTR, jq
does always emit a newline after any JSON text.)

> The draft 01 rule of requiring a whitespace, but not necessarily a newline, is the worst solution. It still allows interop problems if people rely on lenient implementations. It doesn’t guarantee that resync is possible.

Agreed.  I'll change it.

> Suggested text:
>
>   JSON-sequence = *whitespace *(JSON-text *whitespace %x0A *whitespace)

Is '*whitespace %x0A *whitespace' valid, considering that %x0A is in
whitespace?  A greedy parser might choke on that!

Just in case:

  JSON-sequence = *whitespace *(JSON-text *whitespace2 %x0A *whitespace)
  whitespace = %x20 / %x09 / %x0A / %x0D
  whitespace2 = %x20 / %x09 / %x0D
  JSON-text = <given by RFC7159>

I don't have a similar workaround for the resync heuristic ABNF.

> Section 3. Use for logfiles
>
> This section would be better if it focussed on being able to detect the start of the next JSON value from anywhere in a JSON sequence, ie resynchronise.

Good point.  It's nice to be able to seek to arbitrary offsets and
find the next entry.  Traditional syslog-style log files allow that.

> This can be useful for random access to a log file, or for recovering after an entry is corrupted (eg truncated due to a power failure part way through a write).

Yes.

> I would drop the details of trying to recover the 1st entry immediately after a truncated entry. I don't think it is feasible in general (eg when values are numbers); you can't even detect some truncations (eg truncating a whole entry); and the "scan backwards .. looking for a newline followed by a valid JSON text" sounds wrong as the previous newline has been truncated. Scanning back (from a boundary) a char at a time until the whole lot parses as valid JSON is probably the best you can do.

Clearly truncation of whole entries and truncation of trailing
whitespace can't be detected this way, but that's to be expected and I
don't think it's a problem.  Being able to scan backwards would be
nice though.  Scanning backwards for boundaries that are followed by
texts that parse is a fallible heuristic unless texts have no internal
newlines (so that's a very good reason to forbid internal newlines in
_logfiles_, but probably not as a general rule).

> Section 4. Security considerations
>
> The 3rd paragraph no longer applies as whitespace (or hopefully newline) is always required.

I don't agree that it's not required.  If you have a json_loads() type
function that wants to be fed a complete text, and if input texts have
internal newlines, then you might have to repeatedly try json_loads(),
first with one line, then two, ...  This is clearly not a good idea,
and that's what that paragraph warns about.

> Section 5. IANA considerations
>
> As we are specifying a media type, why not define a fragment id. For instance,
> a fragment id that is a decimal integer n (matching 1*DIGIT) selects the nth entry in the sequence.
>
> Typo: change NL to %x0A in the ABNF.
> Typo: "form a logical end" to "from a logical end"

Finding the nth entry might require a linear search for it!  Finding
the first entry from an arbitrary byte offset into the sequence does
not.

(I've been thinking for a while that I'd like to have a SQLite3
virtual table plugin for use with logfiles that assigns rows an
integer rowid that corresponds to the byte offset into the logfile
where the corresponding log entry is found.  This allows table scans,
and it allows for indexing by rowid.  Then one could create an
auxiliary index mapping any date to the rowid of the first log entry
for that day...  This is would be very fast and handy, and it would
make it possible to use SQL for querying logfiles without having to
build/use keyword indexes.  Whereas assigning log entries logical,
sequential rowids would be much slower!)

Nico
--