Re: [apps-discuss] Concise Binary Object Representation (CBOR)

Phillip Hallam-Baker <hallam@gmail.com> Sat, 25 May 2013 16:13 UTC

Return-Path: <hallam@gmail.com>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6EE7A21F8B60 for <apps-discuss@ietfa.amsl.com>; Sat, 25 May 2013 09:13:17 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.016
X-Spam-Level:
X-Spam-Status: No, score=-1.016 tagged_above=-999 required=5 tests=[AWL=-1.583, BAYES_00=-2.599, FF_IHOPE_YOU_SINK=2.166, HTML_MESSAGE=0.001, J_BACKHAIR_33=1, NO_RELAYS=-0.001]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 2DSHxuY+idvq for <apps-discuss@ietfa.amsl.com>; Sat, 25 May 2013 09:13:16 -0700 (PDT)
Received: from mail-we0-x236.google.com (mail-we0-x236.google.com [IPv6:2a00:1450:400c:c03::236]) by ietfa.amsl.com (Postfix) with ESMTP id 6300A21F8A6B for <apps-discuss@ietf.org>; Sat, 25 May 2013 09:13:15 -0700 (PDT)
Received: by mail-we0-f182.google.com with SMTP id q57so3123796wes.13 for <apps-discuss@ietf.org>; Sat, 25 May 2013 09:13:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=5DtW2hArUQDqHB7rMmTxqYnFcjJyGA9DFbhRtT1l08c=; b=Mx4O1RGgk2+c1myI9j5w0UxXOsxmwHRjWO0OupyRlnJOLWhtNSPGSzoJkgxoeXSjq3 9aIQsN66o6MdKcijPZyD9eLwqPhAIVmL2a4xp22nF2qXRM/uklxV9Dfdx8xtomZ9QWKV un29feAyoIFjD0ett17gVj58kx0CZLQlgUqL6u4kJ+KzaujnCO6hlFLS/mz3zvUY9xPx YNvZxiKwtpqzC9OdGJK8W2mb20Pi9edL73byUSMPqCmUhzq1bQz4cGu2XRn4HiydId4A rMb79mWkZ2GhqUhK+XxZtB4g69WFkf7h/3G3zmmXL3PuGRe1Tj6656nOn/RTmtLm1/l0 U1dQ==
MIME-Version: 1.0
X-Received: by 10.180.205.200 with SMTP id li8mr2745939wic.15.1369498394453; Sat, 25 May 2013 09:13:14 -0700 (PDT)
Received: by 10.194.123.225 with HTTP; Sat, 25 May 2013 09:13:14 -0700 (PDT)
In-Reply-To: <CAK3OfOiwE0W=AYCtXh7W1RtrvMC4a1KhNDut=tD1ma+ipRrvHw@mail.gmail.com>
References: <61CB1D18-BABC-4C77-93E6-A9E8CDA8326B@vpnc.org> <CAK3OfOiwE0W=AYCtXh7W1RtrvMC4a1KhNDut=tD1ma+ipRrvHw@mail.gmail.com>
Date: Sat, 25 May 2013 12:13:14 -0400
Message-ID: <CAMm+LwjBWNLPoU+ity+uwY-fNztLspOtfk3HY22OUsXmH+EjJw@mail.gmail.com>
From: Phillip Hallam-Baker <hallam@gmail.com>
To: Nico Williams <nico@cryptonector.com>
Content-Type: multipart/alternative; boundary="001a11c382c6526d9904dd8d31cc"
Cc: Paul Hoffman <paul.hoffman@vpnc.org>, General discussion of application-layer protocols <apps-discuss@ietf.org>
Subject: Re: [apps-discuss] Concise Binary Object Representation (CBOR)
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 25 May 2013 16:13:17 -0000

On Fri, May 24, 2013 at 11:28 PM, Nico Williams <nico@cryptonector.com>wrote:

> Thinking about what *I* want in a binary JSON encoding, and taking
> into account PHB's points about online encoding and decoding, all I
> really want, the one thing I badly want, is counted-bytes and chunked
> Unicode and octet strings.  That's it.  So picture JSON, complete with
> square and curly brackets, but no commas nor colons, just strings
> (byte-counted or chunked), some encoding for numbers, booleans, and
> null.  It's the handing of scalars that sucks about JSON: string
> escaping, number printing and parsing.
>
> Something like this:
>
> {<Unicode string of length 3>foo<Unicode string of length 3>bar<join
> to preceding string, length 3>baz<Unicode string of length
> 3>num<integer value 5>}
>
> as an encoding of { "foo": "barbaz", "num": 5 }, with "barbaz" chunked.
>
> One of the nice things about such an encoding is that it should be
> possible to implement as a fairly small variation on existing code:
> it's almost only a different way of encoding scalar types -- the only
> other difference being that commas and colons are not needed.
>
> With a variable-length encoding of integers and IEEE 754 64-bit
> doubles for reals... that's compact enough.  Not nearly as compact as
> we could get with schemas and PER-like encodings, but good enough for
> a schema-less encoding.
>


I like this proposal, more a sort of JSON-B as in JSON Encoding B

The only Con I can think of is that it will still be backwards incompatible
so why not be more compact? And people might worry about JSON-B getting
confused with JSON.


For my requirements, I think I would still like to be able to binary encode
floating point numbers. The issue there is the precision loss from round
tripping and the ability to encode NaN and +/- infinity

But we do have the entire ASCII code set above 128 to play with (actually
we have more but above 128 is plenty)


The result would be slightly less efficient than a true binary JSON but not
by very much. The tags will still be there.

JSON taged items have an overhead of 4 bytes per entry:  "<tag>":<data>,

(You can add in spaces but they aren't necessary.)

It would not be difficult to reduce those 4 bytes to one. But it isn't a
big win either. The win comes from not having to Base64 the binary chunks.


For purposes of planning tags and making sure there is enough space it is
probably best to start off expansively and considering all the possible
types we might recognize

x80    Terminal String 8 bit length
x81    Terminal String 16 bit length
x82    Terminal String 32 bit length
x83    Terminal String 64 bit length

x84    Non Terminal String Chunk 8 bit length
x85    Non Terminal String Chunk 16 bit length
x86    Non Terminal String Chunk 32 bit length
x87    Non Terminal String Chunk 64 bit length

x88    Terminal Binary 8 bit length
x87    Terminal Binary 16 bit length
x8A    Terminal Binary 32 bit length
x8B    Terminal Binary 64 bit length

x8C    Non Terminal Binary Chunk 8 bit length
x8D    Non Terminal Binary Chunk 16 bit length
x8E    Non Terminal Binary Chunk 32 bit length
x8F    Non Terminal Binary Chunk 64 bit length

x90    IEEE 754 Floating Point binary16  (1)
x91    IEEE 754 Floating Point binary32
x92    IEEE 754 Floating Point binary64
x94    IEEE 754 Floating Point binary128  (1)

x96    IEEE 754 Floating Point decimal32 (1)
x97    IEEE 754 Floating Point decimal64 (1)
x98    IEEE 754 Floating Point decimal128  (1)

xA0    Unsigned Integer 8 (1)
xA1    Unsigned Integer 16 (1)
xA2    Unsigned Integer 32 (1)
xA3    Unsigned Integer 64 (1)
xA4    Unsigned Integer 128 (1)

xA5    Signed Integer 8 (1)
xA6    Signed Integer 16 (1)
xA7    Signed Integer 32 (1)
xA8    Signed Integer 64 (1)
xA9    Signed Integer 128 (1)

xAA    True
xAB    False
xAC    Null


(1) The need to implement these codes is debatable. But the tag asignment
scheme should not foreclose the possibility.

That still leaves a block of 72 codes unused. So if people wanted to go
back and do binary tagging at a later date (JSON-C) that would be possible.

I did try to work out some sort of clever bit mask trick so that the lower
bits of the code would specify the number of additional bytes to follow but
that does not work so well as there are 128 bit values to consider. And at
the end of the day there are only going to be 60 odd code maximum so an
array works fine.


I considered Nico's proposal of a 'continuation block' but that would
require a reader to read in the start of the next block before it can know
what to do with the previous data. The writer should know when it is
starting to write out a chunk that more chunks might follow or not. If no
more data follows the writer just puts out x80 x00 or x88 x00 to close the
stream.



{To follow the rest it might help to look at the diagrams in
http://www.json.org/]

There are a few tricks that could be used here to further reduce space.
consider the production "<tag>":<data>,

The initial " is not really needed since the only valid productions
following an object open brace are the close brace or an element entry. So
" is only needed if you desperately want to have the possibility of a tag
'}drop tables'. OK now I get it, leave the initial " in for Randall Munroe
http://xkcd.com/327/

If the reader sees a code above 7F it can only be binary data so the ":
separator between the tag and the data are superfluous and could be elided

The binary data descriptions are all defined length and so the terminal ,
is not needed.


So if people wanted to, we could adjust the FSR to allow these
abbreviations in JSON-B and save the four bytes per object entry overhead.
But that would still leave the tags there and those can't be eliminated
without some sort of dictionary, either being passed on the wire (which is
what compressors are really doing) or out of band (which is what schema
aware is really about).


So for JSON-C we would need ways to define a binding of a tag to a code and
ways of specifying those codes in object productions.

Keeping the initial " helps us here because it means that the only
currently valid characters after the initial { are ", } and whitespace.


We might just conceivably need more than 64K codes, so the code value is
potentially a 32 bit space.


So the codes I would define are:

C0   8 bit tag code follows
C1   16 bit tag code follows
C2   32 bit tag code follows

C4    8 bit definition follows
C5    16 bit definition follows
C6    32 bit definition follows

C7    8 bit tag with definition follows
C8    16 bit tag with definition follows
C9    32 bit tag with definition follows


the codes c4-c9 would be followed with a string definition. So the first
occurrence of { "foo" : "data" } would become:

x7B               {
xC7 x01 x80 x03 x66 x6f x6f    "foo":     [Code 1]
x80 x04 x64 x61 x74 x61          "data"
x7D               }


On the second occurrence the definition is already there:

x7B               {
xC0 x01  [Code 1]
x80 x04 x64 x61 x74 x61          "data"
x7D               }

An implementation could simply dump the dictionary out at the start of the
message using the C4-C6 codes.


-- 
Website: http://hallambaker.com/