Re: [Cbor] Challenges in supporting bignum tag

Carsten Bormann <cabo@tzi.org> Thu, 02 December 2021 01:12 UTC

Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.7\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <CAM70yxDvL1HKZSz8FMwFb64Tvse6TrqUwNgTmeY7kbMYGazDjg@mail.gmail.com>
Date: Thu, 02 Dec 2021 02:11:49 +0100
Cc: cbor@ietf.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <63CF958A-1153-4932-BE03-3F172E84D60F@tzi.org>
References: <CAM70yxDvL1HKZSz8FMwFb64Tvse6TrqUwNgTmeY7kbMYGazDjg@mail.gmail.com>
To: Emile Cormier <emile.cormier.jr@gmail.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/T0Mq9rKGIOF5NWRKibm68DInFAE>
Subject: Re: [Cbor] Challenges in supporting bignum tag
Precedence: list

Hi Emile,

On 2021-12-02, at 00:38, Emile Cormier <emile.cormier.jr@gmail.com> wrote:
> 
> Hi Everyone,
> 
> I'm in the process of supporting CBOR "bignums" (which I think should have been named "bigint”)

Indeed, “bignum” is a historical name at best.
But it has stuck for half a century (since about 1970 [1]).

> in my C++ serialization library, and have encountered the following challenges:
> 
> 1. When transcoding to other binary formats (such as MessagePack), the sign of the bignum is lost because it is deduced by the tag value and is not part of the message payload. I have to resort to converting it to [sign, byte_array], which makes it no longer round-trippable to its original CBOR format.

I don’t know how other binary interchange formats represent large integers (I probably have actively forgotten because I have seen too many horrible attempts).

The assumption that led to our design is that most large integers will be unsigned, so doing some additional work for negatives is not much of a problem.
If you need to transform a negative integer in sign-magnitude into what the target encoding uses to represent large numbers, you will need some transformation; e.g., if a 2’s complement representation is used, you will need to flip the bits and add or subtract 1 — no such addition/subtraction is needed for bignint (just flip the bits ((-1-N) == ~N)).

> 2. In CBOR decoders that ignore tags, the sign of the bignum is lost when passed up to the application. This goes against the following passage in section 3.4 of the spec:
> 
> "Decoders do not need to understand tags of every tag number, and tags may be of little value in applications where the implementation creating a particular CBOR data item and the implementation decoding that stream know the semantic meaning of each item in the data flow."

Well, they do need to understand tags 2 and 3 if they have a need for negative numbers in addition to unsigned ones.

The comment you cite is more about tags like 1 or 32 — if the context already makes clear that this is a posix time / a URI, respectively, the tag may be redundant (but still useful to push handling of platform data types into a generic decoder).

> So even if the application knows the semantic meaning of the received byte array (it's expecting a bignum), it cannot reconstitute the sign of the received magnitude without knowing the tag.

Correct.
Again, most applications don’t use negative bignums, so if the application knows that it expects a biguint, the tag has no additional information.  But if you need both kinds of bignums, you need to read the tag.

> Note that the same sort of information loss occurs with typed arrays, although it is more likely that an application would know the endianness and element size of a received typed array.

Right — same situation — if only one tag can possibly be used, you can ignore it; if there are several to choose from, you need the tag number.

> 3. Most arbitrary precision libraries represent big integers in signed magnitude format. CBOR wants negative values to be stored as (-1 - N). I therefore need to subtract one from the magnitude stored in a negative multiprecision integer. Because the multiprecision subtraction algorithm needs to start from the least significant byte (to propagate carries), I have to perform subtraction on the entire bigint before I can serialize it to CBOR. This is because CBOR wants the bytes to be in big endian order. It's therefore not possible to process and serialize each source byte "on the fly" starting from the most significant byte. If I'm serializing a huge bigint, then there's a space doubling penalty in the temporary bigint needed for the subtraction.

Yes, that is an interesting observation.  
Maybe it’s worth to spend a tag to represent a negative bigint (bignint) in magnitude format.
But then, tag 3 is not that hard for 1’s complement or magnitude:

For creating normal CBOR bignints, an algorithm for subtracting one that does not need to create another copy would be to first find the place where the carry chain ends from the LSB up (usually the last byte, but could be up to the MSB) and then copy out the number starting from MSB up to that place, and subtract 1 from each byte from there on (well, spewing out 0xFF for all but the highest).  (I hope you don’t have to deal with negative zero, but that is an easy special case if you do.  The other special case is ingesting a CBOR bignint with 0xFF times N, because you have to add a byte for the magnitude.)

> 4. Because of #3, my CBOR codec implementation now depends on multiprecision integer arithmetic.

Subtracting one maybe does not cause a need to pull in a big library (see example algorithm above).

> I'm now forced to do one of the following:
> 
> a) Add a dependency to a multiprecision arithmetic library, which may not be the same one that's used by the application. For example, my library depends on Boost.Multiprecision, but the application wants to serialize a GMP bignum. The application now ends up with dependencies on two multiprecision libraries - yuck!
> 
> b) Write the multiprecision subtraction by hand to avoid introducing a dependency on a multiprecision arithmetic library. While the algorithm isn't too complicated, what I write is certain to be slower than a multiprecision library.
> 
> c) Require that the intermediary byte format passed to the encoder be the same as what CBOR expects (which may not be the same format as what other binary serialization formats want). For example, the header in my library which adds GMP serialization support could use GMP itself to perform the -1 subtraction and pass those bytes to the encoder. I should note here that the encoder interface in my library is agnostic of the serialization format. That is, the JSON encoder interface is the same as the CBOR encoder interface. This makes it easy for the application to serialize to different formats (JSON, CBOR, etc) without having to change anything in the application's serialization code.

I think the algorithm I gave above is simple enough that this should obviate these considerations.

> In short, my life would be much easier if CBOR bignums were different in the following way:
> 
> - The sign is part of the message payload, and not "deduced" by the tag value,

If you mean going to a 2’s complement representation, or stealing the msb of the MSB for a sign bit, that would be a big problem as biguints often are designed fill all of the bytes, so adding a sign inside the representation adds a byte in half of the cases.  Remember that negative values are rarely used, and CBOR is optimized for the biguints we find in many applications.

> - The bytes represent the magnitude of the bignum: no subtraction/addition by 1 for negative values.

That could be added (see discussion of adding a tag for this, above).

> There are other tags that currently depend on the existing bignum tag (bigdecimal, bigfloat, extended time), so I don't expect there to be much interest in a new "sane bignum" tag that addresses the "flaws" I brought up here. I'm just documenting the challenges I'm facing with the existing bignum tag so that they're considered if there's ever a fork of this standard, or if some other binary protocol wants to implement bignums.
> 
> Overall I feel there's too much emphasis in CBOR on cramming everything into as few bytes as possible, and not enough on decoding simplicity and speed.

Simplicity and speed depends a lot on what your platform has.
CBOR is very efficient for biguints.
If you need to handle negative numbers, unfortunately there is a lot of variety, and I’m sorry that what we chose doesn’t map one-to-one with what you find in your platform.

> I also think some tags have been "abused" for storing information that should otherwise be part of the message payload (all in the name of saving a few bytes). This makes tags go beyond their purpose of providing semantic meaning.

I do not agree at all that using a tag to transfer information is abuse.
Tags are a significant innovation that make a lot of things much easier than they are in other formats.
(They also can be overused, which is why CDDL has “unwrap” for tags, and why the comment mentioned above was made.)

> Sorry if this comes across as a rant. It's not my intention to belittle anyone. I just wanted to share my challenges in implementing CBOR for my C++ serialization library. Perhaps someone here will have advice on how I can overcome these challenges.

See my algorithm proposal for converting negative bignums into magnitude and back, above.

Grüße, Carsten

[1]: https://www.dreamsongs.com/Files/HOPL2-Uncut.pdf

[Cbor] Challenges in supporting bignum tag Emile Cormier
Re: [Cbor] Challenges in supporting bignum tag Carsten Bormann
Re: [Cbor] Challenges in supporting bignum tag Laurence Lundblade
Re: [Cbor] Challenges in supporting bignum tag Emile Cormier
Re: [Cbor] Challenges in supporting bignum tag Emile Cormier