Re: [apps-discuss] Concise Binary Object Representation (CBOR)

Phillip Hallam-Baker <hallam@gmail.com> Sun, 26 May 2013 00:40 UTC

Return-Path: <hallam@gmail.com>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6F87B21F8C38 for <apps-discuss@ietfa.amsl.com>; Sat, 25 May 2013 17:40:34 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.884
X-Spam-Level:
X-Spam-Status: No, score=-0.884 tagged_above=-999 required=5 tests=[AWL=-1.451, BAYES_00=-2.599, FF_IHOPE_YOU_SINK=2.166, HTML_MESSAGE=0.001, J_BACKHAIR_33=1, NO_RELAYS=-0.001]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id SyIE5D7amaRo for <apps-discuss@ietfa.amsl.com>; Sat, 25 May 2013 17:40:33 -0700 (PDT)
Received: from mail-wg0-x22c.google.com (mail-wg0-x22c.google.com [IPv6:2a00:1450:400c:c00::22c]) by ietfa.amsl.com (Postfix) with ESMTP id 6E7EC21F8B98 for <apps-discuss@ietf.org>; Sat, 25 May 2013 17:40:32 -0700 (PDT)
Received: by mail-wg0-f44.google.com with SMTP id a12so3427713wgh.11 for <apps-discuss@ietf.org>; Sat, 25 May 2013 17:40:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=/grb1Ly56Ratz9sPvg9BdVTmpT+0lTQ+2PE+eb/z3QE=; b=KkREuVmAdVaBFezkCGdgfBP4l/X3JItpWt0+AnLX3MDsyMY412+B/HAYNuXu6i0l9T ICnr8Fcru3S5Jc/tskTZaPJTD2EHk0OI87VBklOdVZZMOvLzAtSA/VxdjwZXv+OxvQpz 6WBatr6J+8TekXPIQAnFXY98iSI8OPFqyAo52v3RQq5mqCFIUXFlEUML0ZdVtg9phc0D ST/xKOiN3ZpQjAxCH2ec5jVkUcFbUIaK2Dml+E2gTGvKwJpSHf8a1Z/JusvtT38tFhMo scSzQ3lxdzkADA9MBAUtMmj+zgOsRppjvQPGd2VB+MAnUXFYc4rgH06L2TNqaPPhdp7y +VYQ==
MIME-Version: 1.0
X-Received: by 10.180.79.200 with SMTP id l8mr3430600wix.60.1369528831442; Sat, 25 May 2013 17:40:31 -0700 (PDT)
Received: by 10.194.44.100 with HTTP; Sat, 25 May 2013 17:40:31 -0700 (PDT)
In-Reply-To: <CAMm+LwjBWNLPoU+ity+uwY-fNztLspOtfk3HY22OUsXmH+EjJw@mail.gmail.com>
References: <61CB1D18-BABC-4C77-93E6-A9E8CDA8326B@vpnc.org> <CAK3OfOiwE0W=AYCtXh7W1RtrvMC4a1KhNDut=tD1ma+ipRrvHw@mail.gmail.com> <CAMm+LwjBWNLPoU+ity+uwY-fNztLspOtfk3HY22OUsXmH+EjJw@mail.gmail.com>
Date: Sat, 25 May 2013 20:40:31 -0400
Message-ID: <CAMm+LwgavLoTsvAeBXq8jznHLOopqMFwAbmpZ5er0T3P=KygiA@mail.gmail.com>
From: Phillip Hallam-Baker <hallam@gmail.com>
To: Nico Williams <nico@cryptonector.com>
Content-Type: multipart/alternative; boundary="f46d043bd77682078e04dd94477d"
Cc: Paul Hoffman <paul.hoffman@vpnc.org>, General discussion of application-layer protocols <apps-discuss@ietf.org>
Subject: Re: [apps-discuss] Concise Binary Object Representation (CBOR)
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 26 May 2013 00:40:34 -0000

Just a thought.

JSON-C would meet most if not all the requirements for HTTP/2.0 encoding.

It is easy to parse/emit, it is easy to adapt an existing JSON stack to
emit JSON-C. It is efficient in coding and in space and a Web service can
use the same stack for HTTP framing and for the payload.


On Sat, May 25, 2013 at 12:13 PM, Phillip Hallam-Baker <hallam@gmail.com>wrote:

> On Fri, May 24, 2013 at 11:28 PM, Nico Williams <nico@cryptonector.com>wrote:
>
>> Thinking about what *I* want in a binary JSON encoding, and taking
>> into account PHB's points about online encoding and decoding, all I
>> really want, the one thing I badly want, is counted-bytes and chunked
>> Unicode and octet strings.  That's it.  So picture JSON, complete with
>> square and curly brackets, but no commas nor colons, just strings
>> (byte-counted or chunked), some encoding for numbers, booleans, and
>> null.  It's the handing of scalars that sucks about JSON: string
>> escaping, number printing and parsing.
>>
>> Something like this:
>>
>> {<Unicode string of length 3>foo<Unicode string of length 3>bar<join
>> to preceding string, length 3>baz<Unicode string of length
>> 3>num<integer value 5>}
>>
>> as an encoding of { "foo": "barbaz", "num": 5 }, with "barbaz" chunked.
>>
>> One of the nice things about such an encoding is that it should be
>> possible to implement as a fairly small variation on existing code:
>> it's almost only a different way of encoding scalar types -- the only
>> other difference being that commas and colons are not needed.
>>
>> With a variable-length encoding of integers and IEEE 754 64-bit
>> doubles for reals... that's compact enough.  Not nearly as compact as
>> we could get with schemas and PER-like encodings, but good enough for
>> a schema-less encoding.
>>
>
>
> I like this proposal, more a sort of JSON-B as in JSON Encoding B
>
> The only Con I can think of is that it will still be backwards
> incompatible so why not be more compact? And people might worry about
> JSON-B getting confused with JSON.
>
>
> For my requirements, I think I would still like to be able to binary
> encode floating point numbers. The issue there is the precision loss from
> round tripping and the ability to encode NaN and +/- infinity
>
> But we do have the entire ASCII code set above 128 to play with (actually
> we have more but above 128 is plenty)
>
>
> The result would be slightly less efficient than a true binary JSON but
> not by very much. The tags will still be there.
>
> JSON taged items have an overhead of 4 bytes per entry:  "<tag>":<data>,
>
> (You can add in spaces but they aren't necessary.)
>
> It would not be difficult to reduce those 4 bytes to one. But it isn't a
> big win either. The win comes from not having to Base64 the binary chunks.
>
>
> For purposes of planning tags and making sure there is enough space it is
> probably best to start off expansively and considering all the possible
> types we might recognize
>
> x80    Terminal String 8 bit length
> x81    Terminal String 16 bit length
> x82    Terminal String 32 bit length
> x83    Terminal String 64 bit length
>
> x84    Non Terminal String Chunk 8 bit length
> x85    Non Terminal String Chunk 16 bit length
> x86    Non Terminal String Chunk 32 bit length
> x87    Non Terminal String Chunk 64 bit length
>
> x88    Terminal Binary 8 bit length
> x87    Terminal Binary 16 bit length
> x8A    Terminal Binary 32 bit length
> x8B    Terminal Binary 64 bit length
>
> x8C    Non Terminal Binary Chunk 8 bit length
> x8D    Non Terminal Binary Chunk 16 bit length
> x8E    Non Terminal Binary Chunk 32 bit length
> x8F    Non Terminal Binary Chunk 64 bit length
>
> x90    IEEE 754 Floating Point binary16  (1)
> x91    IEEE 754 Floating Point binary32
> x92    IEEE 754 Floating Point binary64
> x94    IEEE 754 Floating Point binary128  (1)
>
> x96    IEEE 754 Floating Point decimal32 (1)
> x97    IEEE 754 Floating Point decimal64 (1)
> x98    IEEE 754 Floating Point decimal128  (1)
>
> xA0    Unsigned Integer 8 (1)
> xA1    Unsigned Integer 16 (1)
> xA2    Unsigned Integer 32 (1)
> xA3    Unsigned Integer 64 (1)
> xA4    Unsigned Integer 128 (1)
>
> xA5    Signed Integer 8 (1)
> xA6    Signed Integer 16 (1)
> xA7    Signed Integer 32 (1)
> xA8    Signed Integer 64 (1)
> xA9    Signed Integer 128 (1)
>
> xAA    True
> xAB    False
> xAC    Null
>
>
> (1) The need to implement these codes is debatable. But the tag asignment
> scheme should not foreclose the possibility.
>
> That still leaves a block of 72 codes unused. So if people wanted to go
> back and do binary tagging at a later date (JSON-C) that would be possible.
>
> I did try to work out some sort of clever bit mask trick so that the lower
> bits of the code would specify the number of additional bytes to follow but
> that does not work so well as there are 128 bit values to consider. And at
> the end of the day there are only going to be 60 odd code maximum so an
> array works fine.
>
>
> I considered Nico's proposal of a 'continuation block' but that would
> require a reader to read in the start of the next block before it can know
> what to do with the previous data. The writer should know when it is
> starting to write out a chunk that more chunks might follow or not. If no
> more data follows the writer just puts out x80 x00 or x88 x00 to close the
> stream.
>
>
>
> {To follow the rest it might help to look at the diagrams in
> http://www.json.org/]
>
> There are a few tricks that could be used here to further reduce space.
> consider the production "<tag>":<data>,
>
> The initial " is not really needed since the only valid productions
> following an object open brace are the close brace or an element entry. So
> " is only needed if you desperately want to have the possibility of a tag
> '}drop tables'. OK now I get it, leave the initial " in for Randall Munroe
> http://xkcd.com/327/
>
> If the reader sees a code above 7F it can only be binary data so the ":
> separator between the tag and the data are superfluous and could be elided
>
> The binary data descriptions are all defined length and so the terminal ,
> is not needed.
>
>
> So if people wanted to, we could adjust the FSR to allow these
> abbreviations in JSON-B and save the four bytes per object entry overhead.
> But that would still leave the tags there and those can't be eliminated
> without some sort of dictionary, either being passed on the wire (which is
> what compressors are really doing) or out of band (which is what schema
> aware is really about).
>
>
> So for JSON-C we would need ways to define a binding of a tag to a code
> and ways of specifying those codes in object productions.
>
> Keeping the initial " helps us here because it means that the only
> currently valid characters after the initial { are ", } and whitespace.
>
>
> We might just conceivably need more than 64K codes, so the code value is
> potentially a 32 bit space.
>
>
> So the codes I would define are:
>
> C0   8 bit tag code follows
> C1   16 bit tag code follows
> C2   32 bit tag code follows
>
> C4    8 bit definition follows
> C5    16 bit definition follows
> C6    32 bit definition follows
>
> C7    8 bit tag with definition follows
> C8    16 bit tag with definition follows
> C9    32 bit tag with definition follows
>
>
> the codes c4-c9 would be followed with a string definition. So the first
> occurrence of { "foo" : "data" } would become:
>
> x7B               {
> xC7 x01 x80 x03 x66 x6f x6f    "foo":     [Code 1]
> x80 x04 x64 x61 x74 x61          "data"
> x7D               }
>
>
> On the second occurrence the definition is already there:
>
> x7B               {
> xC0 x01  [Code 1]
> x80 x04 x64 x61 x74 x61          "data"
> x7D               }
>
> An implementation could simply dump the dictionary out at the start of the
> message using the C4-C6 codes.
>
>
> --
> Website: http://hallambaker.com/
>



-- 
Website: http://hallambaker.com/