Re: [Json] Proposed minimal change for duplicate names in objects

Tatu Saloranta <tsaloranta@gmail.com> Sun, 07 July 2013 17:57 UTC

Return-Path: <tsaloranta@gmail.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2606821F9A35 for <json@ietfa.amsl.com>; Sun, 7 Jul 2013 10:57:52 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.599
X-Spam-Level:
X-Spam-Status: No, score=-2.599 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, HTML_MESSAGE=0.001, NO_RELAYS=-0.001]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id r49hglEHuKWk for <json@ietfa.amsl.com>; Sun, 7 Jul 2013 10:57:50 -0700 (PDT)
Received: from mail-wg0-x22f.google.com (mail-wg0-x22f.google.com [IPv6:2a00:1450:400c:c00::22f]) by ietfa.amsl.com (Postfix) with ESMTP id 3DB3621F9A2A for <json@ietf.org>; Sun, 7 Jul 2013 10:57:50 -0700 (PDT)
Received: by mail-wg0-f47.google.com with SMTP id l18so3148680wgh.14 for <json@ietf.org>; Sun, 07 Jul 2013 10:57:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=sFfNV4iAr1DqQpDZMTlEauPhbht94noOwiluq1CofLk=; b=zXHxe+Pe4R8R1iSzn3zhiyWAib5GARqDCrT9kr5IeKuH4VXmBCpzeObqjdzLHhE4Tp rZnfVSPjagR4MUoKKNk4qUKctKmTaaoK5jerAMk8CYMn9CCxv5h8/tJZ1SoESiagU6ej BFQZka9CROfNZE2gnRDQ2GG7J16E8HjesR/k3nhzmrfm+H6q4k6u9aInJpTcENhSMJ0E v6UFU3Ouk7+BL3+g5fk6L3azO/jzgDfPuuxv3Ub5BGzfYsdaexJ5E16jKz8YDW2Fg9mi y26Pp0bgKcNSADkgun24C8WeMPlFyGrY7vqcbg6n8Xy1bftR+Z2JeMTC7ptGxO/yEovz xj9g==
MIME-Version: 1.0
X-Received: by 10.180.208.17 with SMTP id ma17mr9866811wic.7.1373219868251; Sun, 07 Jul 2013 10:57:48 -0700 (PDT)
Received: by 10.227.34.199 with HTTP; Sun, 7 Jul 2013 10:57:47 -0700 (PDT)
In-Reply-To: <CAHBU6iuWLcXv0QKR=Ow8gkzoZLmoZjqYCqXDXR8FLVb7w7M2Tw@mail.gmail.com>
References: <B86E1D4B-1DC8-4AD6-B8B3-E989599E0537@vpnc.org> <CAK3OfOj3MNNhjwo2bMa5CgoqynzMRVvviBXC8szxt5D17Z7FDg@mail.gmail.com> <51D3C63C.5030703@cisco.com> <51D48023.1020008@qti.qualcomm.com> <20130703201143.GL32044@mercury.ccil.org> <00cd01ce7a9f$19adeaa0$4d09bfe0$@augustcellars.com> <00d701ce7aa6$cc5fe700$651fb500$@augustcellars.com> <CAK3OfOiWrWCvNQneokyycV1Jb98M=UR-U7z0dhxUjzVdf+PwDw@mail.gmail.com> <CAHBU6itdi3B1rWv2TiOYhL1QuOVxrFKt7OTWRoG+6TgV8Bc_uw@mail.gmail.com> <CAK3OfOgOYA5fas0oomF5amjP1bR5F=0+uve7mFD4=FMoEV7sWg@mail.gmail.com> <CAGrxA24y4D62XY-YnbDvKVwNKUickcEFxv1FUhc_yqG4KP-m-w@mail.gmail.com> <CAHBU6iuWLcXv0QKR=Ow8gkzoZLmoZjqYCqXDXR8FLVb7w7M2Tw@mail.gmail.com>
Date: Sun, 7 Jul 2013 10:57:47 -0700
Message-ID: <CAGrxA252xregfKViPOH7ustTSwOoorrffwdbUuCrryVxmqHUNw@mail.gmail.com>
From: Tatu Saloranta <tsaloranta@gmail.com>
To: Tim Bray <tbray@textuality.com>
Content-Type: multipart/alternative; boundary=001a11c3808e721f9504e0efaaa4
Cc: Nico Williams <nico@cryptonector.com>, Jim Schaad <ietf@augustcellars.com>, "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Proposed minimal change for duplicate names in objects
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 07 Jul 2013 17:57:52 -0000

On Sat, Jul 6, 2013 at 9:58 PM, Tim Bray <tbray@textuality.com> wrote:

> I’ll assume you’re right when you say dupe detection has been measured as
> expensive at run time, but I’m thinking that if I were writing a reader I'd
> implement a hash table with a test-and-set method, so I admit I'm surprised
> by the finding.  I think I’d need to see a little more research before I’d
> accept that as a given.  -T
>
>
Point is that no hash table need be created at all. All one has is simple
tokenizer (lexer). No tree is built, just parent stack with information to
match close markers, report error locations. And bit of state information
to validate well-formedness wrt separators.
I agree that if a per-document symbol table/map is maintained, extra cost
would probably be negligible and not worth worrying too much about.

Hash table will typically be used at higher level, if building an object
representation, such as tree model, or native objects: either as structure
to build (tree), or mapping to object fields (name to method/field). At
this level catching duplicates is easier and more efficient to handle, and
I would not be worried about mandating such handling.

So why separation to layers like this? While writing a naive tokenizer is
an easy task, it takes much more effort to create an efficient one. But
there are not many variations there, so once an efficient one exists, it is
easy to use that as a building block, save time and have some confidence
that simple encoding/decoding works correctly and efficiently.
Conversely, higher level components vary a lot regarding functionality.
This is similar to division between low-level SAX or Stax APIs, and
higher-level DOM (tree) and JAXB (data binding) for XML on Java. I think
C#/dotnet has similar separation between pull-parsing and higher-level
abstractions, but it seems absent from most scripting languages.

As to research, I don't know if this --
https://github.com/eishay/jvm-serializers/wiki -- helps, but contrasting
these two entries:

                           create     ser   deser   total   size  +dfl

  json/jackson/manual           62    1137    1519  2656    468   253
  json/jackson/tree             63    2045    2650  4695    485   259

gives difference between manually written data-binding (streaming parser,
hand-written code to match names to fields), and building a HashMap-based
simple tree model (built on streaming parser) as 2-to-1 performance
difference (2656 nsec vs 4695 nsec).
Part of this is with overhead of Maps vs. plain java objects, but not all.

I could prototype with actual detection too (in fact, there is an RFE for
streaming parser to do just that -- and I consider it a legitimate thing to
do as optional settings) and see how much actual overhead a highly-tuned
minimal-state parser will have on Java platform.

-+ Tatu +-


>
> On Sat, Jul 6, 2013 at 9:37 PM, Tatu Saloranta <tsaloranta@gmail.com>wrote;wrote:
>
>> On Sat, Jul 6, 2013 at 7:57 PM, Nico Williams <nico@cryptonector.com>wrote;wrote:
>>
>>> On Sat, Jul 6, 2013 at 8:44 PM, Tim Bray <tbray@textuality.com> wrote:
>>> > This feels like a no-brainer to me, but that’s probably because (as
>>> I’ve
>>> > said before) I’m an API guy, and the only use for JSON objects in my
>>> world
>>> > is to transfer a hash table or database record or whatever from here to
>>> > there, and there, and in such a situation dupes can never be useful or
>>> > intended and can only be a symptom of breakage (or, in the JOSE case, a
>>> > symptom of a malicious attack on my crypto).
>>>
>>> I agree.  As a security guy I would prefer if one way or another we
>>> end up with no dup names, but as an "API guy" myself I think of the
>>> streaming parsers (they offer an API after all).  Just say the magic
>>> words: "to hell with minimal state streaming parsers" or perhaps
>>> something to the effect that *some* component of a layered application
>>> MUST reject objects with dup names.  It's either or.  Let's choose.
>>>
>>> I'm happy with "some component of a layered application MUST reject
>>> objects with duplicate names" -- I prefer this to the "no minimal
>>> state streaming parsers" alternative.
>>>
>>> I will assume that in general objects rarely have lots of names, so
>>> that parsers need not keep that much state in order to check for dups.
>>>  Requiring parsers to reject objects with dup names is my second
>>> choice.
>>>
>>
>> Just to make sure: I also do not have any use for duplicates, and
>> consider them flaws in processing. I have never used duplicates for
>> anything, nor find that interesting approach.
>> The only concern really is that of mandating (or not) detection and/or
>> prevention at lowest level components of commonly used processing stacks
>> (low-level push/pull parser, higher-level library or app code that builds
>> full representations), since this is significant cost, based on extensive
>> profiling I have done at this level.
>>
>> Case of application code directly using streaming parser/generators is
>> not nearly as common as that of frameworks using them to produce higher
>> level abstractions.
>> These higher level abstraction (JSON tree representation, binding to
>> native objects) do either report errors such as duplicates, and at very
>> least can detect it and use consistent handling. They can do it much more
>> efficiently than low-level components since they have to build
>> representations that then serve as data structures for detecting/preventing
>> duplicates.
>>
>> Specific concern for me is this: if specification mandates detection
>> and/or prevention for parser, generators, without any mention that 'parser'
>> and 'generator' are logical concepts (thing that reads JSON, thing that
>> writes JSON), I will have lots of users who promptly demand low-level
>> components to force checks.
>> And then I get to spend lots of time discussing why such checks can (and
>> IMO should) still be pushed to higher level of processing. It is amazing
>> how much FUD can be generated based on cursory reading of specifications.
>>
>> This concern extends to suggested "Internet JSON messages" specification.
>>
>> I would like simple suggestion that some component of the processing
>> should/must detect and report duplicates; and prevent producing of same;
>> or, lengthier explanation of what "parser" and "generator" mean (parser is
>> such a horrible misnomer -- there is very very little parsing involved,
>> it's just a lexer, and optional object builder -- but I digress).
>>
>> -+ Tatu +-
>>
>>
>