Re: [Cbor] Interactions of packed CBOR and tags

Jim Schaad <ietf@augustcellars.com> Thu, 03 September 2020 16:03 UTC

Return-Path: <ietf@augustcellars.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6CC723A10B5 for <cbor@ietfa.amsl.com>; Thu, 3 Sep 2020 09:03:37 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id EdAd4T6Q9obq for <cbor@ietfa.amsl.com>; Thu, 3 Sep 2020 09:03:34 -0700 (PDT)
Received: from mail2.augustcellars.com (augustcellars.com [50.45.239.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B1A073A10A9 for <cbor@ietf.org>; Thu, 3 Sep 2020 09:03:23 -0700 (PDT)
Received: from Jude (73.180.8.170) by mail2.augustcellars.com (192.168.0.56) with Microsoft SMTP Server (TLS) id 15.0.1395.4; Thu, 3 Sep 2020 09:03:16 -0700
From: Jim Schaad <ietf@augustcellars.com>
To: 'Brendan Moran' <Brendan.Moran@arm.com>
CC: 'Carsten Bormann' <cabo@tzi.org>, cbor@ietf.org
References: <00c101d67cb5$2588b790$709a26b0$@augustcellars.com> <E30F54B6-1A63-48AC-89AE-61983654B5A9@tzi.org> <00cc01d67cc9$766c7b60$63457220$@augustcellars.com> <4AE9B2FA-EEB3-4B45-96E4-9DC85118567D@arm.com>
In-Reply-To: <4AE9B2FA-EEB3-4B45-96E4-9DC85118567D@arm.com>
Date: Thu, 03 Sep 2020 09:03:14 -0700
Message-ID: <016f01d6820b$bc7d7cc0$35787640$@augustcellars.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Mailer: Microsoft Outlook 16.0
Thread-Index: AQHF41LwYmnJHHEpyMn9bwuLrqikDwI3uBqqAWrESqACK9UjWalJ0K+w
Content-Language: en-us
X-Originating-IP: [73.180.8.170]
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/RMzZjHDmvHvgn1c1HAZT-YtPcZA>
Subject: Re: [Cbor] Interactions of packed CBOR and tags
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 03 Sep 2020 16:03:44 -0000


-----Original Message-----
From: Brendan Moran <Brendan.Moran@arm.com> 
Sent: Thursday, September 3, 2020 4:57 AM
To: Jim Schaad <ietf@augustcellars.com>
Cc: Carsten Bormann <cabo@tzi.org>; cbor@ietf.org
Subject: Re: [Cbor] Interactions of packed CBOR and tags

I’m not certain whether this overlaps or not, but I think it does. There’s a limitation in the current specification of packed cbor: Only text prefixes and singular CBOR elements can be packed. This may not seem like a big limitation at first, however I think that there are some missed opportunities here.

I’ll use CIRIs (https://tools.ietf.org/html/draft-hartke-t2trg-ciri-01) as an example. I realise they may not be “representative” but they illustrate two observations I’ve made about packed CBOR:

1. Some things need postfix sharing rather than prefix sharing. Domain names are the primary example of this. Because subdomains are prepended, rather than appended like the rest of the URI path, they need postfix sharing, assuming that the TLD has lowest entropy and the subdomains have the highest entropy in a given structure.

2. Packing sub-sequences of elements within containers is something that we should consider.


For example, suppose that I need to many CIRIs that represent, for example, the following URIs:

https://www.ietf.org
http://www.ietf.org
https://datatracker.ietf.org/group/cbor/about/
https://datatracker.ietf.org/doc/draft-bormann-cbor-packed/

Encoded as CIRIs and arranged into an array, these URIs take up 161 bytes (N.B. it would only be 151 bytes for the raw URIs) [
  [1,"https", 2, "www.ietf.org"],
  [1,"http", 2, "www.ietf.org"],
  [1,"https", 2, "datatracker.ietf.org", 5, 0, 6, "group", 6, "cbor", 6, "about"],
  [1,"https", 2, "datatracker.ietf.org", 5, 0, 6, "doc", 6, "draft-bormann-cbor-packed"] ]

[JLS] I really wish the updated CRI document would get published.  This issue was discussed last sprint from somewhere around IETF 107 and into May that produced a proposal that uses a well known URI pattern along with one built in dictionary.  Using that encoding you would end up with

[  [2, "www.ietf.org"],
  [1, "www.ietf.org"],
  [2, "datatracker.ietf.org", "group", "cbor", "about"],
  [2, "datatracker.ietf.org", "doc", "draft-bormann-cbor-packed"] ]

This means that we are looking at an encoding size of 125 bytes rather than 161 bytes.   So when that finally gets written up with will have a better starting point.

[/JLS]

If I wanted to pack this structure, I think the result would look something like this:

Naive (127 bytes, 78% compression ratio):
6([
  [
    [1, 6("s"), 2, simple(0)],
    [1, 6(""), 2, simple(0)],
    [1, 6("s"), 2, simple(1), 5, 0, 6, "group", 6, "cbor", 6, "about"],
    [1, 6("s"), 2, simple(1), 5, 0, 6, "doc", 6, "draft-bormann-cbor-packed"]
  ],
  [
    "http"
  ],
  "www.ietf.org",
  "datatracker.ietf.org"
])

Better (124 bytes, 77% compression ratio):
6([
  [
    [1, simple(1), 2, simple(2)],
    [1, simple(0), 2, simple(2)],
    [1, simple(1), 2, simple(3), 5, 0, 6, "group", 6, "cbor", 6, "about"],
    [1, simple(1), 2, simple(3), 5, 0, 6, "doc", 6, "draft-bormann-cbor-packed"]
  ],
  [
    simple(0)
  ],
  "http",
  6("s"),
  "www.ietf.org",
  "datatracker.ietf.org"
])

I think we can do a bit better, but it’s convoluted… (122 bytes, 76% compression ratio):
6([
  [
    [1, simple(1), 2, simple(3)],
    [1, simple(0), 2, simple(3)],
    [1, simple(1), 2, simple(4), 5, 0, 6, "group", 6, "cbor", 6, "about"],
    [1, simple(1), 2, simple(4), 5, 0, 6, "doc", 6, "draft-bormann-cbor-packed"]
  ],
  [
    simple(0),
    "www",
    "datatracker"
  ],
  "http",
  6("s"),
  ".ietf.org",
  224(simple(2)),
  225(simple(2))
])

[JLS]  I don't find this to be very convoluted at all and it is what my compression algorithm generates automatically.  I look at the piece I split off from the prefix and see if I have "enough" duplicates to compress them down.  I am still trying to decide how to do some extraction from "the middle" of things.  Consider looking at 
["www.merch.ietf.org", "www.datatracker.ietf.org"]

It would be nice to think about compressing that as 

6([ "www.", "merch", "datatracker"], [".ietf.org""], 224(225(simple(2))), 224(226(simple(2)))] 

Where we have pulled both prefix and postfix strings extracted and maximize the amount of commonality.  

[/JLS]


Back to the observations I made above.

1. Compressing the domain names doesn’t work very well. Maybe this is unique to domain names, but I think we need more data on that.
2. There’s a lot of regularity left here, but it’s all in the form of sequences of array elements. For example, the sequence [1, simple(1), 2] shows up regularly, as does [simple(4), 5, 0 , 6].

[JLS] This would be taken care of by the fact that we have discussed doing the same prefix processing on arrays as well.  This was brought up specifically for the CRI case where we saw that this was going to be an issue.


Here’s where I think the discussion overlaps. Enabling packing of sequences of elements would require one of three changes:

1. Only indefinite-length containers can have packed items.
2. The container count refers to unpacked count, which makes the packed CBOR invalid.
3. The container count refers to packed count, which means the decoder must adjust the total whenever a reference is encountered. This has additional implications for maps.

[JLS] I think that we can all agree that option 2 would be a completely unacceptable.  For the purpose of the compressed output, it needs to be definite length as that is going to be the shortest.  I think in the above however, you are potentially confusing the result of compression vs the result of decompression.  The compressed form should always be definite length encoded, the result of decompression could be either definite length or indefinite length according to how it operates and not what the compressed form looks like and therefore does not need to be specified in the document.

If it turns out that postfix sharing of data really is unique to domain names and that the problem isn’t generic, maybe we could solve it another way, for example special handling of anything within Tag 32. Sequence packing, however, is a space where I think we need a solution.

[JLS] I expect that for sequence and map compression we are going to find that we are going to end up looking at compression at the start, middle and end of these.  In the case of map compression then the question of how this is represented and made into a deterministic encoding might become very interesting.  It is much easier to just compressed the keys and values and ignore the map structure.

Jim
[/JLS]

Brendan

> On 28 Aug 2020, at 00:26, Jim Schaad <ietf@augustcellars.com> wrote:
>
>
>
> -----Original Message-----
> From: Carsten Bormann <cabo@tzi.org>
> Sent: Thursday, August 27, 2020 2:32 PM
> To: Jim Schaad <ietf@augustcellars.com>
> Cc: draft-bormann-cbor-packed@ietf.org; cbor@ietf.org
> Subject: Re: [Cbor] Interactions of packed CBOR and tags
>
>
>
>> On 2020-08-27, at 23:00, Jim Schaad <ietf@augustcellars.com> wrote:
>>
>> While building a test library of strings for evaluating my algorithm, 
>> I ended up with a question of how tags interact with the idea of CBOR packing.
>> Specifically, if I use a standard date/time string with tag 0, should 
>> that text string be considered as a candidate for packing?
>>
>> 0("1970-01-01T00:00Z") could potentially be compressed to 
>> 0(simple(3))
>>
>> The problem is that this is no longer a valid CBOR encoding so it 
>> would not seem to be a legal thing to do.
>>
>> Question:  Must packed CBOR be valid CBOR or does that requirement 
>> only apply to unpacked CBOR?
>
> Great question.
>
> The cop-out could be: either.
> Since CBOR-valid packed CBOR is a subset of (just well-formed) packed CBOR, it could be a parameter given to the compressor whether that is allowed to use compression opportunities like the above or not.
>
> What are the benefits/drawbacks:
>
> * (just well-formed) packed CBOR may lead to trouble with a generic decoder that cannot handle (present to the application) invalid constructs like 0(simple(3)).
> The application can decide whether it wants to live with this limitation or not.
>
> * the structural coherence of the packed structure (that this draft is about) will be expressed as a validity constraint.  It is a bit weird to then relax validity of what goes in there, but not entirely without precedent (e.g., tag 24, even though there is a more explicit firewall here).
>
> * not using those compression opportunities can be wasteful, not just for the example given above (tag 0), but also for tags like 32.
>
> I think I would emphasize the (just well-formed) packed CBOR, but still introduce CBOR-valid packed CBOR as a selectable additional constraint for applications that need to work with pre-packed (designed before packed was invented) generic decoders.
>
> [JLS] Yes I agree that only requiring that the result be well-formed makes the most sense.  It probably makes sense to discuss the implications in the document.  A more interesting case might be tag 26 which could have duplicate or prefix lines of text coming up frequently.
>
> I think it might make sense to reference tag 25 and say that this does the same thing only much better.
> [/JLS]
>
> Do we need to encode this selection?  (E.g., via different top-level tags.)  Probably not.
>
> [JLS] I don't think that there needs to be different top-level tags.
>
> Jim
>
>
> Grüße, Carsten
>
>
> _______________________________________________
> CBOR mailing list
> CBOR@ietf.org
> https://www.ietf.org/mailman/listinfo/cbor

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.