Re: [Cbor] Review draft-bormann-cbor-packed-00

Carsten Bormann <cabo@tzi.org> Sun, 26 July 2020 01:00 UTC

Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 $3608.120.23.2.1$)
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <000e01d65b79$ad6f6ee0$084e4ca0$@augustcellars.com>
Date: Sun, 26 Jul 2020 03:00:42 +0200
Cc: cbor@ietf.org, draft-bormann-cbor-packed@ietf.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <6FD968BE-B5C0-46B5-AE03-A69A24E850C1@tzi.org>
References: <00f701d65adb$1c49e1f0$54dda5d0$@augustcellars.com> <04545D98-A936-4DD0-8446-3634E6C12667@tzi.org> <000e01d65b79$ad6f6ee0$084e4ca0$@augustcellars.com>
To: Jim Schaad <ietf@augustcellars.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/UINHSQS7lx6P0qTdrd8crlvFJUk>
Subject: Re: [Cbor] Review draft-bormann-cbor-packed-00
Precedence: list

On 2020-07-16, at 16:01, Jim Schaad <ietf@augustcellars.com> wrote:
> 
> 
> 
>> -----Original Message-----
>> From: CBOR <cbor-bounces@ietf.org> On Behalf Of Carsten Bormann
>> Sent: Wednesday, July 15, 2020 12:43 PM
>> To: Jim Schaad <ietf@augustcellars.com>
>> Cc: cbor@ietf.org; draft-bormann-cbor-packed@ietf.org
>> Subject: Re: [Cbor] Review draft-bormann-cbor-packed-00
>> 
>> Hi Jim,
>> 
>> Thanks for this lightning feedback.  Let me try to answer as quickly...
>> 
>>> On 2020-07-15, at 21:06, Jim Schaad <ietf@augustcellars.com> wrote:
>>> 
>>> Carsten,
>>> 
>>> I am finding this document really hard to follow.
>> 
>> Yes.  It contains the specification, but not yet any tutorial material.
>> Way too few examples.
>> 
>>> I am sure that we can get
>>> it to read much better in the future.
>> 
>> Absolutely.
>> 
>>> Some of the issues that I am having
>>> are:
>>> 
>>> 1.  How the Tag 6 is used is dependent on the type of the embedded
>>> content and this is not explained at all.  It took me a while to
>>> understand why
>>> #6.6(0) was not a potential duplicate value between sections 2.1 and 2.2.
>> 
>> Yes.
>> 
>> Section 3, second paragraph says this, but that may be a little late…
>> 
>> (I always read the discussion before the normative part, so I just didn’t notice.)
> 
> [JLS] I tend to read documents from the front to the back.  😊

OK, moved it to Section 2.


> 
>> 
>>> 2.  I am trying to get a handle on where you think this would be used
>>> and how much compression is going to give me.
>> 
>> Use where?  Where simple data models exhibit too much sharing, or where the
>> strings that go into those have too long common prefixes (which happens a lot
>> in RDF-derived environments).
>> And the point is that there is no need for decompression at the consumer.
>> 
>> How much compression?  One data point:
>> http://www.w3.org/2017/07/wot-f2f/slides/w3c-serialization.pdf — slide 6.
>> 1/3 off by sharing, ~ 1/5 more off by prefix compression. 47 % left.
>> (Deflate gives 29 %, so is 57 % of cbor-packed.) A static dictionary would help
>> some more.
> 
> Yes, but are you willing to settle for ~30% rather than ~33% when looking at sharing.

These are approximate numbers, so I’m not sure what you are aiming at here.

> 
>> 
>>> Part of this is going to be a
>>> question of how much compression is obtained and the difference
>>> between optimal and reasonable.
>> 
>> I you need optimal compression, go for LZ77 over the whole thing.
>> Cbor-packed is reasonable :-)
>> 
>>> For example, Tags 224-255 would end up with 2 bytes for the tag, plus
>>> the encoding of the suffixed data.  Moving to #6.6([#, suffix]) ends
>>> up with 4 bytes plus the encoding of the suffix.
>>> This means that you would want to save a minimum of 8 bytes in the
>>> prefix rather than 4 bytes.  You are also going to have two bytes
>>> consumed in the suffix table for strings.
>> 
>> Right.  That’s why the draft is a bit aggressive about obtaining 1-byte and 2-
>> byte references.
> 
> [JLS] And I think that good-enough could be less aggressive about this.

The main reason to be a bit aggressive is that we should be avoiding artificial limitations.  On the other end of the spectrum, going for a 1+0 tag and a good number of 1+0 simple values is really worth it.

> 
>> 
>>> 4. I am having a hard time figuring out how you are going to get a
>>> reference loop because I was assuming that any shared items that would
>>> need to be unpacked would have it's own dictionary defined.
>> 
>> No, only the top level has shared/prefix items.
>> Shared/prefix item 5 could reference Shared/prefix item 5 (insert larger loop
>> as needed).
>> 
>> E.g., shared item 0 could be “sold out”, and shared item 1 could be {“price”:
>> 9.99, “stock”: simple(0)}.
>> 
>>> This seems to imply that
>>> shared items are using the same dictionary.   This needs an example.
>> 
>> Right.  Half an example above.
> 
> [JLS] Yes this is what I expected after coming across the recursion problem. But this is now also becoming more complicated than just looking for common strings when building the compression tables.

You get some good compression from just doing sharing of strings.  It gets better by sharing some structures and some prefixes.

(I haven’t made a proposal how to share a “prefix” (subset) of a map.  That should help some more.)

> 
>> 
>>> 5. Is there a reason not to switch to using '^' for exponentiation?
>>> This is the more mathematical one symbol.
>> 
>> Yes. ^ is XOR in C.
>> In math, we do that differently.
>> Like 2⁶⁴ :-)  But the RFC editor won’t let me do that, I’m sure...

Went for <sup> and the inevitable text fallback.

>>> 6.  As we compose documents nested documents will become more
>>> frequent.  The question is should the multiple levels be merged into a
>>> single compression table.
>> 
>> Well, you could try it, and if it gets longer (because the references to a larger
>> dictionary get larger), you don’t.
>> 
>>> 7.  I am not happy about the complete abandonment of providing
>>> guidance for a packing algorithm.  At a minimum things like doing
>>> advance computing on compression, using pre-defined vocabularies (if
>>> not dictionaries) is most likely usable.
>> 
>> Just putting the JSON object labels into a dictionary, sorting by number of use,
>> and removing those that are not referenced often enough or where the
>> references get larger than the actual strings, is often 80 % of the gain.
>> My little toy packer (250 LOC) simply tries to share *all* duplicate sub-items.
>> The fun comes in if a duplicate sub-item has a sub-item shared with another
>> sub-item, which may make referencing the duplicate sub-item less optimal
>> than just copying the sub-item.  I’m not sure my algorithm is very sane, but I
>> like the output, and for the details that’s what we have BYGS (*) for.
> 
> [JLS] Well - you might have BYGS but I don't have such a thing.  I suppose that some of it is also going to drop down to local knowledge of the space.  "We know these things tend to occur a lot - use them."

OK, give me some time to polish my compressor so I don’t terminally turn myself into a clown in the data compression community :-)

But the basic idea is to look for sharing and prefix opportunities, and then strike those that don’t actually improve the representation size.  Sort by frequency of use, build the tables, and then build the rump from the tables.  
< 250 LOC if done at the level of aptitude I have for compression algorithms :-).

>>> 8.  Should we provide an option to use an application defined
>>> dictionary as that would pack even more at the penalty that the
>>> document could not be unpacked without that dictionary.
>> 
>> Yes.  The question is how that is obtained (from context?), and how the
>> dictionary items are merged with those shared items listed in the item (the
>> dictionary may or may not have the most frequently used shared items in the
>> specific data item to be packed).  I wouldn’t want to COSE-sign a packed CBOR
>> data item where it is hackable from where the dictionary came...
> 
> That raises an interesting question.  Should one only sign the unpacked rather than packed version?  At this point in time we do not have a lossy compression so that is probably not a required thing.  I had briefly be thinking about detached signatures where one might merge compressions.

Whatever is being signed needs to include the referenced dictionaries.
I think the building of these signing inputs needs to be part of the cbor-packed specification if we add referenced dictionaries to it.
Added a short note to the security considerations.

The small updates discussed are now in

Html: https://www.ietf.org/id/draft-bormann-cbor-packed-01.html
Diff: https://www.ietf.org/rfcdiff?url2=draft-bormann-cbor-packed-01

Grüße, Carsten

[Cbor] Review draft-bormann-cbor-packed-00 Jim Schaad
Re: [Cbor] Review draft-bormann-cbor-packed-00 Carsten Bormann
Re: [Cbor] Review draft-bormann-cbor-packed-00 Jim Schaad
Re: [Cbor] Review draft-bormann-cbor-packed-00 Carsten Bormann
Re: [Cbor] Review draft-bormann-cbor-packed-00 Jim Schaad
Re: [Cbor] Review draft-bormann-cbor-packed-00 Carsten Bormann