Re: [Cbor] Challenges in supporting bignum tag

Emile Cormier <emile.cormier.jr@gmail.com> Thu, 02 December 2021 02:10 UTC

Return-Path: <emile.cormier.jr@gmail.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 753743A079B for <cbor@ietfa.amsl.com>; Wed, 1 Dec 2021 18:10:24 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.097
X-Spam-Level:
X-Spam-Status: No, score=-2.097 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id MEoZE-bet0GA for <cbor@ietfa.amsl.com>; Wed, 1 Dec 2021 18:10:19 -0800 (PST)
Received: from mail-yb1-xb33.google.com (mail-yb1-xb33.google.com [IPv6:2607:f8b0:4864:20::b33]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B155C3A079D for <cbor@ietf.org>; Wed, 1 Dec 2021 18:10:19 -0800 (PST)
Received: by mail-yb1-xb33.google.com with SMTP id g17so68963957ybe.13 for <cbor@ietf.org>; Wed, 01 Dec 2021 18:10:19 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Gb1ed+RePomII0EibCCzcy3kLXL+mUhRBUZITs2rVfU=; b=TB9+FKLWsGAaCJ2uCUI0p3UNhqlvQL5xnl7mZ1dz5wt9Nu8kbaQM3a/4rTlsJs/VI6 nqlz11vt5lJ4eSiSesKbuV9iDIu3kMVQL01wY4P1SW7E+nyYGgblTo6kW0BGdAb6Bbpv lJh0lDSw3s1siy5+KKh1poxNwfcbLfCHK4/Q70dII27ZYJW3VCLPHKuiwTiBGpojIg3x hHkC6ANXvIDQd3CIIyR8Ij5Bh0R1/7rw5bywybZ/Lt70rtlbSMx6ODDVTY+xw+X0r6jT iGeF3z1IOt54/04qU+weR3cXAhpPYmDKKGDURY9gjRdGZxSEmXoW5j/3dTRw5hirSmcu 4AhQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Gb1ed+RePomII0EibCCzcy3kLXL+mUhRBUZITs2rVfU=; b=CjmBvEOKYI8zVnITPfq9JCnjtGFkDLR6Fq77pc0PzahSiio/URL5UxU9hllhf5jbAO wnyIicUsG7jTKV7G+pQ7g814TAu5VLJB9c7jlLbP7fY1bbbjIGMu+wVNGeJQss8ZBrf+ uvB1SiKMgffPZLbF4QaYEopnhcs5tm5yW7mrfYQXDQdB/xwcv1f3wDWvZfSoBVWzo6gF x4zrDVeDXjPPQOJmpl3RE1ZLA2tujnhVI1AnKr3u7s+qUKvw12prF6NrRlXoMvkc3A0Z FUjV3Co9OUksWAxBQUJZOIe/EwoJBcs1lVavnHeBLJqCMmnv6QOvus584dCoWrjD3DS5 eTHg==
X-Gm-Message-State: AOAM533pjUeubzUWoqgMfuAKdUjz8onil7de2NZQyOAE1VbyCjg9Vodn 8uBhEl6yQqjWcVZVqj0PuVmQPc7HOKO9/hlBz3jxXbSHa5E=
X-Google-Smtp-Source: ABdhPJw2h+zLfiFmy1/vYQ+pPW3T8gyXYhcSVellhY11QSbB9fvBHSn0PpsPhz/t0gh1fdyNmOTRV2JY5OOjdW9hKlA=
X-Received: by 2002:a25:2346:: with SMTP id j67mr11863490ybj.467.1638411018399; Wed, 01 Dec 2021 18:10:18 -0800 (PST)
MIME-Version: 1.0
References: <CAM70yxDvL1HKZSz8FMwFb64Tvse6TrqUwNgTmeY7kbMYGazDjg@mail.gmail.com> <63CF958A-1153-4932-BE03-3F172E84D60F@tzi.org>
In-Reply-To: <63CF958A-1153-4932-BE03-3F172E84D60F@tzi.org>
From: Emile Cormier <emile.cormier.jr@gmail.com>
Date: Wed, 1 Dec 2021 22:10:07 -0400
Message-ID: <CAM70yxAc7m027ZUnPRRymRuOxDK=1G8NBPohgN7J3V9u3Nuhyg@mail.gmail.com>
To: Carsten Bormann <cabo@tzi.org>
Cc: cbor@ietf.org
Content-Type: multipart/alternative; boundary="000000000000bf74fe05d2204a58"
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/LdJEJ0TVSn62k_QtwJ_MYbVgrsQ>
Subject: Re: [Cbor] Challenges in supporting bignum tag
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 02 Dec 2021 02:10:25 -0000

On Wed, Dec 1, 2021 at 9:11 PM Carsten Bormann <cabo@tzi.org> wrote:

I don’t know how other binary interchange formats represent large integers
> (I probably have actively forgotten because I have seen too many horrible
> attempts).
>

I've checked MessagePack, UBJSON, Smile, and BSON. Only Smile has support
for what they call "BigInteger":

*"Encoded as token indicator followed by 7-bit escaped binary (with
Unsigned VInt (no-zigzag encoding) as length indicator) that represent
magnitude value (byte array)"*

So the BigInteger is unsigned-only with Smile; no negative values.

CBOR is the only binary format I'm aware of which supports negative
bignums. I was thinking of the future when (yet another) binary format
might come along and decide to support negative bignums.


> The assumption that led to our design is that most large integers will be
> unsigned, so doing some additional work for negatives is not much of a
> problem.
> If you need to transform a negative integer in sign-magnitude into what
> the target encoding uses to represent large numbers, you will need some
> transformation; e.g., if a 2’s complement representation is used, you will
> need to flip the bits and add or subtract 1 — no such addition/subtraction
> is needed for bignint (just flip the bits ((-1-N) == ~N)).
>

If bignums are used as a mantissa inside bigdecimal and bigfloat (which
don't have a separate sign field), then it's no longer true that negative
bigfloats are rarely used. When transcoding a CBOR bigdecimal or bigfloat
to JSON, the sign of the value will effectively be lost, unless an extra
sign field is added in the array structure (which would make it impossible
to round-trip the resulting structure back to CBOR's original format).


> > 2. In CBOR decoders that ignore tags, the sign of the bignum is lost
> when passed up to the application. This goes against the following passage
> in section 3.4 of the spec:
> >
> > "Decoders do not need to understand tags of every tag number, and tags
> may be of little value in applications where the implementation creating a
> particular CBOR data item and the implementation decoding that stream know
> the semantic meaning of each item in the data flow."
>
> Well, they do need to understand tags 2 and 3 if they have a need for
> negative numbers in addition to unsigned ones.
>

That's my point. Applications are now forced to detect the presence of tags
2 and 3, even if they know to expect a signed bignum. Those applications
can no longer be agnostic of the serialization format (CBOR vs MessagePack).


> The comment you cite is more about tags like 1 or 32 — if the context
> already makes clear that this is a posix time / a URI, respectively, the
> tag may be redundant (but still useful to push handling of platform data
> types into a generic decoder).
>

Tags 1 or 32 are examples of what I consider "nice" semantic tags. Even if
they're absent, an application can still interpret the values correctly by
agreement in the application protocol.


> > Note that the same sort of information loss occurs with typed arrays,
> although it is more likely that an application would know the endianness
> and element size of a received typed array.
>
> Right — same situation — if only one tag can possibly be used, you can
> ignore it; if there are several to choose from, you need the tag number.
>

In the case of typed arrays, the application protocol might have already
established that a specific typed array stride/endianness be used. I think
the loss of the tag is probably less catastrophic in that case.


> > 3. Most arbitrary precision libraries represent big integers in signed
> magnitude format. CBOR wants negative values to be stored as (-1 - N). I
> therefore need to subtract one from the magnitude stored in a negative
> multiprecision integer. Because the multiprecision subtraction algorithm
> needs to start from the least significant byte (to propagate carries), I
> have to perform subtraction on the entire bigint before I can serialize it
> to CBOR. This is because CBOR wants the bytes to be in big endian order.
> It's therefore not possible to process and serialize each source byte "on
> the fly" starting from the most significant byte. If I'm serializing a huge
> bigint, then there's a space doubling penalty in the temporary bigint
> needed for the subtraction.
>
> Yes, that is an interesting observation.
> Maybe it’s worth to spend a tag to represent a negative bigint (bignint)
> in magnitude format.
> But then, tag 3 is not that hard for 1’s complement or magnitude:
>

I'm not sure I want to introduce yet another tag if there's no general
interest other than mine. In my own applications, I can simply establish a
schema of [sign, magnitude_byte_array].

I just realized that the (-1 - N) format is an optimization for the 128-bit
integer case. Those bits can be memcpy'ed into a native uint128 without
worrying about trap representations.


> For creating normal CBOR bignints, an algorithm for subtracting one that
> does not need to create another copy would be to first find the place where
> the carry chain ends from the LSB up (usually the last byte, but could be
> up to the MSB) and then copy out the number starting from MSB up to that
> place, and subtract 1 from each byte from there on (well, spewing out 0xFF
> for all but the highest).  (I hope you don’t have to deal with negative
> zero, but that is an easy special case if you do.  The other special case
> is ingesting a CBOR bignint with 0xFF times N, because you have to add a
> byte for the magnitude.)
>
> > 4. Because of #3, my CBOR codec implementation now depends on
> multiprecision integer arithmetic.
>
> Subtracting one maybe does not cause a need to pull in a big library (see
> example algorithm above).
>

That's some great advice, thanks! I'll have to work it out on paper to
understand it better.


> > In short, my life would be much easier if CBOR bignums were different in
> the following way:
> >
> > - The sign is part of the message payload, and not "deduced" by the tag
> value,
>
> If you mean going to a 2’s complement representation, or stealing the msb
> of the MSB for a sign bit, that would be a big problem as biguints often
> are designed fill all of the bytes, so adding a sign inside the
> representation adds a byte in half of the cases.  Remember that negative
> values are rarely used, and CBOR is optimized for the biguints we find in
> many applications.
>

I meant having the format like this:

<tag> [sign, magnitude_byte_array]

where the sign and magnitude_byte_array are enclosed in a two-element
array. So even if the tag is stripped, an application can still fully
reconstitute the signed bignum.


>
> > - The bytes represent the magnitude of the bignum: no
> subtraction/addition by 1 for negative values.
>
> That could be added (see discussion of adding a tag for this, above).
>

Again, there's no point if I'm the only one interested, but I appreciate
you contemplating it. :-)


> Simplicity and speed depends a lot on what your platform has.
> CBOR is very efficient for biguints.
> If you need to handle negative numbers, unfortunately there is a lot of
> variety, and I’m sorry that what we chose doesn’t map one-to-one with what
> you find in your platform.
>

I know you can't please everyone. :-)


> > I also think some tags have been "abused" for storing information that
> should otherwise be part of the message payload (all in the name of saving
> a few bytes). This makes tags go beyond their purpose of providing semantic
> meaning.
>
> I do not agree at all that using a tag to transfer information is abuse.
> Tags are a significant innovation that make a lot of things much easier
> than they are in other formats.
> (They also can be overused, which is why CDDL has “unwrap” for tags, and
> why the comment mentioned above was made.)
>

Applications that depend on those tags that convey information (other than
semantic meaning) are effectively locked into using CBOR. That's what I
don't like about them.

In web browser land, a simple CBOR-to-JSON converter is no longer
sufficient to preserve the necessary information when tags are stripped out.


> > Sorry if this comes across as a rant. It's not my intention to belittle
> anyone. I just wanted to share my challenges in implementing CBOR for my
> C++ serialization library. Perhaps someone here will have advice on how I
> can overcome these challenges.
>
> See my algorithm proposal for converting negative bignums into magnitude
> and back, above.
>

Yes, that might just be the ticket. Thanks again for that suggestion. :-)

Cheers,
Emile