Re: [Cbor] Interactions of packed CBOR and tags

Jim Schaad <> Thu, 03 September 2020 18:14 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 6E5013A0FB9 for <>; Thu, 3 Sep 2020 11:14:07 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id jDJ-BtRlkQZu for <>; Thu, 3 Sep 2020 11:14:04 -0700 (PDT)
Received: from ( []) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 01AC63A0FB0 for <>; Thu, 3 Sep 2020 11:13:59 -0700 (PDT)
Received: from Jude ( by ( with Microsoft SMTP Server (TLS) id 15.0.1395.4; Thu, 3 Sep 2020 11:13:53 -0700
From: Jim Schaad <>
To: 'Brendan Moran' <>
CC: 'Carsten Bormann' <>, <>
References: <00c101d67cb5$2588b790$709a26b0$> <> <00cc01d67cc9$766c7b60$63457220$> <> <016f01d6820b$bc7d7cc0$35787640$> <>
In-Reply-To: <>
Date: Thu, 3 Sep 2020 11:13:50 -0700
Message-ID: <018001d6821d$fb710980$f2531c80$>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Mailer: Microsoft Outlook 16.0
Thread-Index: AQHF41LwYmnJHHEpyMn9bwuLrqikDwI3uBqqAWrESqACK9UjWQHDDhkNAm3wJDqpKHhikA==
Content-Language: en-us
X-Originating-IP: []
Archived-At: <>
Subject: Re: [Cbor] Interactions of packed CBOR and tags
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Thu, 03 Sep 2020 18:14:07 -0000

-----Original Message-----
From: Brendan Moran <> 
Sent: Thursday, September 3, 2020 9:58 AM
To: Jim Schaad <>
Cc: Carsten Bormann <>rg>;
Subject: Re: [Cbor] Interactions of packed CBOR and tags

> On 3 Sep 2020, at 17:03, Jim Schaad <> wrote:
> -----Original Message-----
> From: Brendan Moran <>
> Sent: Thursday, September 3, 2020 4:57 AM
> To: Jim Schaad <>
> Cc: Carsten Bormann <>rg>;
> Subject: Re: [Cbor] Interactions of packed CBOR and tags
> I’m not certain whether this overlaps or not, but I think it does. There’s a limitation in the current specification of packed cbor: Only text prefixes and singular CBOR elements can be packed. This may not seem like a big limitation at first, however I think that there are some missed opportunities here.
> I’ll use CIRIs ( as an example. I realise they may not be “representative” but they illustrate two observations I’ve made about packed CBOR:
> 1. Some things need postfix sharing rather than prefix sharing. Domain names are the primary example of this. Because subdomains are prepended, rather than appended like the rest of the URI path, they need postfix sharing, assuming that the TLD has lowest entropy and the subdomains have the highest entropy in a given structure.
> 2. Packing sub-sequences of elements within containers is something that we should consider.
> For example, suppose that I need to many CIRIs that represent, for example, the following URIs:
> Encoded as CIRIs and arranged into an array, these URIs take up 161 
> bytes (N.B. it would only be 151 bytes for the raw URIs) [  
> [1,"https", 2, ""],  [1,"http", 2, ""],  
> [1,"https", 2, "", 5, 0, 6, "group", 6, "cbor", 6, 
> "about"],  [1,"https", 2, "", 5, 0, 6, "doc", 6, 
> "draft-bormann-cbor-packed"] ]
> [JLS] I really wish the updated CRI document would get published.  
> This issue was discussed last sprint from somewhere around IETF 107 
> and into May that produced a proposal that uses a well known URI 
> pattern along with one built in dictionary.  Using that encoding you 
> would end up with
> [  [2, ""],
>  [1, ""],
>  [2, "", "group", "cbor", "about"],  [2, 
> "", "doc", "draft-bormann-cbor-packed"] ]
> This means that we are looking at an encoding size of 125 bytes rather than 161 bytes.   So when that finally gets written up with will have a better starting point.
> [/JLS]

[BJM] That’s really interesting. I like where it’s going. However it’s still going to be bigger than a cbor-encoded plain text URIs if you have more than a few > 23 byte path segments, which is a bit annoying. For example, https:// is 8 bytes. The scheme above compresses that to 1 byte, but introduces an overhead of 1 byte for the array, so we’ve saved 6 bytes. Each path segment has a 1-byte separator in plain text. In the scheme above, it has a text tag. This is the same total, except that the text encoding can omit the last delimiter. This means that we’ve saved only 5 bytes. If there are more than 23 path segments, then we lose another byte. If 5 or more path segments are more than 23 bytes long, we are at parity.

It depends on what the goals of the scheme are. CRI not always smaller. If size were the only goal, a hybrid approach that retains path separators would yield smaller results:

75 bytes:
>  [2, 
> "
> ed"]
76 bytes:
>  [2, "", "a-24-character-path-1234", 
> "draft-bormann-cbor-packed"]


[JLS]  The original goal was to free devices from the work of having to parse URIs and allow them to just do a simpler copy into CoAP options when it gets back answers from a directory server among others.  Then we did see that the proposals were inflating the sizes too much.  I believe that contains the latest proposal although there may have been off line changes I don't know about.

> I think we can do a bit better, but it’s convoluted… (122 bytes, 76% compression ratio):
> 6([
>  [
>    [1, simple(1), 2, simple(3)],
>    [1, simple(0), 2, simple(3)],
>    [1, simple(1), 2, simple(4), 5, 0, 6, "group", 6, "cbor", 6, "about"],
>    [1, simple(1), 2, simple(4), 5, 0, 6, "doc", 6, 
> "draft-bormann-cbor-packed"]  ],  [
>    simple(0),
>    "www",
>    "datatracker"
>  ],
>  "http",
>  6("s"),
>  "",
>  224(simple(2)),
>  225(simple(2))
> ])
> [JLS]  I don't find this to be very convoluted at all and it is what 
> my compression algorithm generates automatically.  I look at the piece 
> I split off from the prefix and see if I have "enough" duplicates to 
> compress them down.  I am still trying to decide how to do some 
> extraction from "the middle" of things.  Consider looking at 
> ["", ""]
> It would be nice to think about compressing that as
> 6([ "www.", "merch", "datatracker"], ["""], 
> 224(225(simple(2))), 224(226(simple(2)))]
> Where we have pulled both prefix and postfix strings extracted and maximize the amount of commonality.
> [/JLS]

Under the existing -01 draft, I’m not sure you can do much better than:


[JLS] After correcting both of our notations, in this case the it seems that both of the notations are the same size.  From your response, I am not sure if you believe that the notation I gave was not legal under the current draft or not.

> Back to the observations I made above.
> 1. Compressing the domain names doesn’t work very well. Maybe this is unique to domain names, but I think we need more data on that.
> 2. There’s a lot of regularity left here, but it’s all in the form of sequences of array elements. For example, the sequence [1, simple(1), 2] shows up regularly, as does [simple(4), 5, 0 , 6].
> [JLS] This would be taken care of by the fact that we have discussed doing the same prefix processing on arrays as well.  This was brought up specifically for the CRI case where we saw that this was going to be an issue.

I’m glad to hear that. Did I miss something on the mailing list? I assumed that this was still TBD since I didn’t see it there. Does that mean that the use of Tag 6 as the first prefix is being dropped so that array references are explicit? Or are array prefixes prohibited in the first slot? Or something else?

[JLS] is the message where Carsten made a proposal which has not gotten into the draft yet.

> Here’s where I think the discussion overlaps. Enabling packing of sequences of elements would require one of three changes:
> 1. Only indefinite-length containers can have packed items.
> 2. The container count refers to unpacked count, which makes the packed CBOR invalid.
> 3. The container count refers to packed count, which means the decoder must adjust the total whenever a reference is encountered. This has additional implications for maps.
> [JLS] I think that we can all agree that option 2 would be a completely unacceptable.  For the purpose of the compressed output, it needs to be definite length as that is going to be the shortest.  I think in the above however, you are potentially confusing the result of compression vs the result of decompression.  The compressed form should always be definite length encoded, the result of decompression could be either definite length or indefinite length according to how it operates and not what the compressed form looks like and therefore does not need to be specified in the document.

[BJM] I think I may have been confusing things, yes. When considering how a parser should handle a string prefix, it’s clear that the size is the packed size. This will cause some interesting problems for pull parsers, particularly those that use lazy evaluation. The SUIT demo parser saves pointers into the manifest CBOR when setting variables, rather than storing the variables themselves. It occurs to me that this will need also need a list of current reference tables for lazy evaluation. In many cases, that may be larger than the value being set, which will make things interesting. This will require some careful handling and probably some usage notes for implementors.

> If it turns out that postfix sharing of data really is unique to domain names and that the problem isn’t generic, maybe we could solve it another way, for example special handling of anything within Tag 32. Sequence packing, however, is a space where I think we need a solution.
> [JLS] I expect that for sequence and map compression we are going to find that we are going to end up looking at compression at the start, middle and end of these.  In the case of map compression then the question of how this is represented and made into a deterministic encoding might become very interesting.  It is much easier to just compressed the keys and values and ignore the map structure.
> Jim
> [/JLS]

[BJM] I agree. It becomes more interesting if you want to pack odd numbers of elements, for example 2 keys and one value.
[JLS] I don't think that would ever be legal since it would not produce valid CBOR in the packed case.  The best you could do would be to convert to an array and pack that.


>> On 28 Aug 2020, at 00:26, Jim Schaad <> wrote:
>> -----Original Message-----
>> From: Carsten Bormann <>
>> Sent: Thursday, August 27, 2020 2:32 PM
>> To: Jim Schaad <>
>> Cc:;
>> Subject: Re: [Cbor] Interactions of packed CBOR and tags
>>> On 2020-08-27, at 23:00, Jim Schaad <> wrote:
>>> While building a test library of strings for evaluating my 
>>> algorithm, I ended up with a question of how tags interact with the idea of CBOR packing.
>>> Specifically, if I use a standard date/time string with tag 0, 
>>> should that text string be considered as a candidate for packing?
>>> 0("1970-01-01T00:00Z") could potentially be compressed to
>>> 0(simple(3))
>>> The problem is that this is no longer a valid CBOR encoding so it 
>>> would not seem to be a legal thing to do.
>>> Question:  Must packed CBOR be valid CBOR or does that requirement 
>>> only apply to unpacked CBOR?
>> Great question.
>> The cop-out could be: either.
>> Since CBOR-valid packed CBOR is a subset of (just well-formed) packed CBOR, it could be a parameter given to the compressor whether that is allowed to use compression opportunities like the above or not.
>> What are the benefits/drawbacks:
>> * (just well-formed) packed CBOR may lead to trouble with a generic decoder that cannot handle (present to the application) invalid constructs like 0(simple(3)).
>> The application can decide whether it wants to live with this limitation or not.
>> * the structural coherence of the packed structure (that this draft is about) will be expressed as a validity constraint.  It is a bit weird to then relax validity of what goes in there, but not entirely without precedent (e.g., tag 24, even though there is a more explicit firewall here).
>> * not using those compression opportunities can be wasteful, not just for the example given above (tag 0), but also for tags like 32.
>> I think I would emphasize the (just well-formed) packed CBOR, but still introduce CBOR-valid packed CBOR as a selectable additional constraint for applications that need to work with pre-packed (designed before packed was invented) generic decoders.
>> [JLS] Yes I agree that only requiring that the result be well-formed makes the most sense.  It probably makes sense to discuss the implications in the document.  A more interesting case might be tag 26 which could have duplicate or prefix lines of text coming up frequently.
>> I think it might make sense to reference tag 25 and say that this does the same thing only much better.
>> [/JLS]
>> Do we need to encode this selection?  (E.g., via different top-level tags.)  Probably not.
>> [JLS] I don't think that there needs to be different top-level tags.
>> Jim
>> Grüße, Carsten
>> _______________________________________________
>> CBOR mailing list
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.