Re: [Cbor] Challenges in supporting bignum tag

Emile Cormier <emile.cormier.jr@gmail.com> Thu, 02 December 2021 02:41 UTC

Return-Path: <emile.cormier.jr@gmail.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5FEEC3A0808 for <cbor@ietfa.amsl.com>; Wed, 1 Dec 2021 18:41:55 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.087
X-Spam-Level:
X-Spam-Status: No, score=-2.087 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, T_SPF_HELO_TEMPERROR=0.01, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id s28foKOvBxTf for <cbor@ietfa.amsl.com>; Wed, 1 Dec 2021 18:41:43 -0800 (PST)
Received: from mail-yb1-xb36.google.com (mail-yb1-xb36.google.com [IPv6:2607:f8b0:4864:20::b36]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 3CB1C3A0806 for <cbor@ietf.org>; Wed, 1 Dec 2021 18:41:43 -0800 (PST)
Received: by mail-yb1-xb36.google.com with SMTP id x32so69175071ybi.12 for <cbor@ietf.org>; Wed, 01 Dec 2021 18:41:43 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=j9panAfQuf/0TKXov7CYp2OdUc88J1bTgAwBY7T8i4E=; b=bQo8RmfRliNgYHZ75yfe3gadRCSsX1dl+ALBGH9jQhtGmfk2/I39nlQi2kuTEKt/HP /4L+QGgwRcWUTC9TNRWd4BlW7AqJIV+S/KvtrOSdSV+eEbs+we0h9ZF/rmEvc86hzwG0 0+cHbcDo6sGkaTS52DWB1y1gPSqS67LmteHCV3BwAUljtU5JcP8tQ6s7dJCBXp0r7YuG aJuMPS5eX/Dc+LhJe0Pyf/7R68S3R+WdhayiEDyJu9V0KCu3zl0mQbOBou9yH3EZxsje b+8v4GUcnJ0SSp7qUqPC5B4l0IHImgw/qJOfMazbWPtVS45KEEm0IBw4RI0+ryHVfw4+ tkYA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=j9panAfQuf/0TKXov7CYp2OdUc88J1bTgAwBY7T8i4E=; b=6DlVoqJDr+g80z8nGgesB1xQ16RYXUk4H00NSPP879rUdOlJ13ZJckImBNqqMvJyl6 mWwPhQYUa0Qd/qYdFJO+DRCAiqd1cF6JefeU3qq6VPC1wXQx8OAxAIQw4Ff/x+uNopQk zLEn84bZwNGXQGBindJ9gYOBf26W663b/jel39z97nzN60cjz3pF/0JDSVchAJZh8b1j x/6BpK7P9Ll7XqG9JhS/pw6a7Hjp2UMmRkb6KjJ/1bD3zTeQgHvbxZ83/TGoOBY5WTXB zYKQ4WnvvYQQKIdorD7tW2j6ZwFoNYM/CiK08b2Hl7e3g1xGuCuQls/BJR8LuKaiPSLt gW6A==
X-Gm-Message-State: AOAM530fSsPic4jM8YSo7dapjfPxTU9cjeQipEtOO0YlmMNOT8kW6DVH ZmPyTEXE/VFQT+3vvfNZ9NR8eu79QC7TU236aug=
X-Google-Smtp-Source: ABdhPJzMRjeSyui9fo3yOpkE9GNVfm367JIkSEBi1Ikf/Zp7BkVTVQuIQo/VEumiJa0fjWEEQu1BKFdqASLWPjbiPKA=
X-Received: by 2002:a25:d002:: with SMTP id h2mr12756253ybg.581.1638412901146; Wed, 01 Dec 2021 18:41:41 -0800 (PST)
MIME-Version: 1.0
References: <CAM70yxDvL1HKZSz8FMwFb64Tvse6TrqUwNgTmeY7kbMYGazDjg@mail.gmail.com> <63CF958A-1153-4932-BE03-3F172E84D60F@tzi.org> <3FEC84FA-BAAC-47AD-A95A-AE1C74A89E14@island-resort.com>
In-Reply-To: <3FEC84FA-BAAC-47AD-A95A-AE1C74A89E14@island-resort.com>
From: Emile Cormier <emile.cormier.jr@gmail.com>
Date: Wed, 1 Dec 2021 22:41:30 -0400
Message-ID: <CAM70yxA0STNFxQhdqMQV-qcbdva6+BUyr3N4hvxdu9hUSRGb2g@mail.gmail.com>
To: Laurence Lundblade <lgl@island-resort.com>
Cc: Carsten Bormann <cabo@tzi.org>, cbor@ietf.org
Content-Type: multipart/alternative; boundary="000000000000f7e3b405d220ba9c"
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/XXmYp07OdhZiEPW037rKhP_g5dQ>
Subject: Re: [Cbor] Challenges in supporting bignum tag
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 02 Dec 2021 02:41:56 -0000

On Wed, Dec 1, 2021 at 9:48 PM Laurence Lundblade <lgl@island-resort.com>
wrote:

> I  continue to spend a lot of time figuring out how tags should be treated
> in QCBOR. I definitely got it wrong my first time some years ago. Part of
> the trouble was that they were called “Optional Tags” in RFC 7049.  I
> really don’t think of them as optional any more.
>

After implementing time support in my C++ serialization library, I'm
starting to agree with you about tags not being optional anymore. CBOR tags
1001 and 1002 for extended time may contain bignums (via bigfloats or
bigdecimals), where their sign will be stripped off if you just dump them
in a generic parse tree without the tags.

In addition to the classic null, bool, number, array, object/map parse
events from JSON, I'm now having to introduce "special" events like
time_length (tag 1002), scaled_time (tag 1001), time_interval (tag 1003),
big_int, big_float, and big_decimal. These special events contain an
"intermediary" representation of the data exchanged between codecs (CBOR,
MessagePack, JSON) and the C++/Boost data types (std::chrono::time_point,
Boost.DateTime, Boost.Multiprecision, etc). It doesn't help that various
codecs have different opinions on what the wire format should be like
(ISO8601 time strings in JSON vs extended time maps in CBOR vs extended
array time type in MessagePack).


> I’ve also gone to some trouble to deal with the cases were a tag is
> “implicit” rather than explicit. Implicit here means that data item in the
> protocol is implied to have the format of a tag’s content, but there’s no
> tag number. This is “unwrapped” in CDDL. It took some extra work for big
> nums because the sign is in the tag.
>

I might have to check out your QCBOR code for inspiration in dealing with
bignums.


> I don’t think it would be a good idea to change this or define a new tag
> at this point. There’s too many number format tags CBOR already.
>

I agree, and I wasn't expecting there to be a new tag for signed bignums.
My "rant" was more to get some helpful advice (which I did), and just in
case the inventor of the next binary-JSON-like protocol happens to read it
(so that they can avoid the same pain points).

On Wed, Dec 1, 2021 at 9:48 PM Laurence Lundblade <lgl@island-resort.com>
wrote:

> I’ve implemented CBOR big numbers in QCBOR including conversion to double.
>
> Agree that having the sign in the tag made it more work, but there would
> have been work if the sign was elsewhere. Also agree that it is a bit of an
> exception compared to the way other standard tags work. The QCBOR API for
> big nums is a little different than for other tags whose content is a
> string. However in the end, the compromise seems reasonable.
>
> I  continue to spend a lot of time figuring out how tags should be treated
> in QCBOR. I definitely got it wrong my first time some years ago. Part of
> the trouble was that they were called “Optional Tags” in RFC 7049.  I
> really don’t think of them as optional any more. A decoder really can’t
> ignore them in any general way. I don’t think of a tag with content that is
> a byte string as a byte string. It is some new type that you need to
> explicitly know about to interpret correctly and a generic decoder API has
> to take that into account for just about all tags.
>
> I’ve also gone to some trouble to deal with the cases were a tag is
> “implicit” rather than explicit. Implicit here means that data item in the
> protocol is implied to have the format of a tag’s content, but there’s no
> tag number. This is “unwrapped” in CDDL. It took some extra work for big
> nums because the sign is in the tag.
>
> I don’t think it would be a good idea to change this or define a new tag
> at this point. There’s too many number format tags CBOR already.
>
> LL
>
>
>
> > On Dec 1, 2021, at 5:11 PM, Carsten Bormann <cabo@tzi.org> wrote:
> >
> > Hi Emile,
> >
> > On 2021-12-02, at 00:38, Emile Cormier <emile.cormier.jr@gmail.com>
> wrote:
> >>
> >> Hi Everyone,
> >>
> >> I'm in the process of supporting CBOR "bignums" (which I think should
> have been named "bigint”)
> >
> > Indeed, “bignum” is a historical name at best.
> > But it has stuck for half a century (since about 1970 [1]).
> >
> >> in my C++ serialization library, and have encountered the following
> challenges:
> >>
> >> 1. When transcoding to other binary formats (such as MessagePack), the
> sign of the bignum is lost because it is deduced by the tag value and is
> not part of the message payload. I have to resort to converting it to
> [sign, byte_array], which makes it no longer round-trippable to its
> original CBOR format.
> >
> > I don’t know how other binary interchange formats represent large
> integers (I probably have actively forgotten because I have seen too many
> horrible attempts).
> >
> > The assumption that led to our design is that most large integers will
> be unsigned, so doing some additional work for negatives is not much of a
> problem.
> > If you need to transform a negative integer in sign-magnitude into what
> the target encoding uses to represent large numbers, you will need some
> transformation; e.g., if a 2’s complement representation is used, you will
> need to flip the bits and add or subtract 1 — no such addition/subtraction
> is needed for bignint (just flip the bits ((-1-N) == ~N)).
> >
> >> 2. In CBOR decoders that ignore tags, the sign of the bignum is lost
> when passed up to the application. This goes against the following passage
> in section 3.4 of the spec:
> >>
> >> "Decoders do not need to understand tags of every tag number, and tags
> may be of little value in applications where the implementation creating a
> particular CBOR data item and the implementation decoding that stream know
> the semantic meaning of each item in the data flow."
> >
> > Well, they do need to understand tags 2 and 3 if they have a need for
> negative numbers in addition to unsigned ones.
> >
> > The comment you cite is more about tags like 1 or 32 — if the context
> already makes clear that this is a posix time / a URI, respectively, the
> tag may be redundant (but still useful to push handling of platform data
> types into a generic decoder).
> >
> >> So even if the application knows the semantic meaning of the received
> byte array (it's expecting a bignum), it cannot reconstitute the sign of
> the received magnitude without knowing the tag.
> >
> > Correct.
> > Again, most applications don’t use negative bignums, so if the
> application knows that it expects a biguint, the tag has no additional
> information.  But if you need both kinds of bignums, you need to read the
> tag.
> >
> >> Note that the same sort of information loss occurs with typed arrays,
> although it is more likely that an application would know the endianness
> and element size of a received typed array.
> >
> > Right — same situation — if only one tag can possibly be used, you can
> ignore it; if there are several to choose from, you need the tag number.
> >
> >> 3. Most arbitrary precision libraries represent big integers in signed
> magnitude format. CBOR wants negative values to be stored as (-1 - N). I
> therefore need to subtract one from the magnitude stored in a negative
> multiprecision integer. Because the multiprecision subtraction algorithm
> needs to start from the least significant byte (to propagate carries), I
> have to perform subtraction on the entire bigint before I can serialize it
> to CBOR. This is because CBOR wants the bytes to be in big endian order.
> It's therefore not possible to process and serialize each source byte "on
> the fly" starting from the most significant byte. If I'm serializing a huge
> bigint, then there's a space doubling penalty in the temporary bigint
> needed for the subtraction.
> >
> > Yes, that is an interesting observation.
> > Maybe it’s worth to spend a tag to represent a negative bigint (bignint)
> in magnitude format.
> > But then, tag 3 is not that hard for 1’s complement or magnitude:
> >
> > For creating normal CBOR bignints, an algorithm for subtracting one that
> does not need to create another copy would be to first find the place where
> the carry chain ends from the LSB up (usually the last byte, but could be
> up to the MSB) and then copy out the number starting from MSB up to that
> place, and subtract 1 from each byte from there on (well, spewing out 0xFF
> for all but the highest).  (I hope you don’t have to deal with negative
> zero, but that is an easy special case if you do.  The other special case
> is ingesting a CBOR bignint with 0xFF times N, because you have to add a
> byte for the magnitude.)
> >
> >> 4. Because of #3, my CBOR codec implementation now depends on
> multiprecision integer arithmetic.
> >
> > Subtracting one maybe does not cause a need to pull in a big library
> (see example algorithm above).
> >
> >> I'm now forced to do one of the following:
> >>
> >> a) Add a dependency to a multiprecision arithmetic library, which may
> not be the same one that's used by the application. For example, my library
> depends on Boost.Multiprecision, but the application wants to serialize a
> GMP bignum. The application now ends up with dependencies on two
> multiprecision libraries - yuck!
> >>
> >> b) Write the multiprecision subtraction by hand to avoid introducing a
> dependency on a multiprecision arithmetic library. While the algorithm
> isn't too complicated, what I write is certain to be slower than a
> multiprecision library.
> >>
> >> c) Require that the intermediary byte format passed to the encoder be
> the same as what CBOR expects (which may not be the same format as what
> other binary serialization formats want). For example, the header in my
> library which adds GMP serialization support could use GMP itself to
> perform the -1 subtraction and pass those bytes to the encoder. I should
> note here that the encoder interface in my library is agnostic of the
> serialization format. That is, the JSON encoder interface is the same as
> the CBOR encoder interface. This makes it easy for the application to
> serialize to different formats (JSON, CBOR, etc) without having to change
> anything in the application's serialization code.
> >
> > I think the algorithm I gave above is simple enough that this should
> obviate these considerations.
> >
> >> In short, my life would be much easier if CBOR bignums were different
> in the following way:
> >>
> >> - The sign is part of the message payload, and not "deduced" by the tag
> value,
> >
> > If you mean going to a 2’s complement representation, or stealing the
> msb of the MSB for a sign bit, that would be a big problem as biguints
> often are designed fill all of the bytes, so adding a sign inside the
> representation adds a byte in half of the cases.  Remember that negative
> values are rarely used, and CBOR is optimized for the biguints we find in
> many applications.
> >
> >> - The bytes represent the magnitude of the bignum: no
> subtraction/addition by 1 for negative values.
> >
> > That could be added (see discussion of adding a tag for this, above).
> >
> >> There are other tags that currently depend on the existing bignum tag
> (bigdecimal, bigfloat, extended time), so I don't expect there to be much
> interest in a new "sane bignum" tag that addresses the "flaws" I brought up
> here. I'm just documenting the challenges I'm facing with the existing
> bignum tag so that they're considered if there's ever a fork of this
> standard, or if some other binary protocol wants to implement bignums.
> >>
> >> Overall I feel there's too much emphasis in CBOR on cramming everything
> into as few bytes as possible, and not enough on decoding simplicity and
> speed.
> >
> > Simplicity and speed depends a lot on what your platform has.
> > CBOR is very efficient for biguints.
> > If you need to handle negative numbers, unfortunately there is a lot of
> variety, and I’m sorry that what we chose doesn’t map one-to-one with what
> you find in your platform.
> >
> >> I also think some tags have been "abused" for storing information that
> should otherwise be part of the message payload (all in the name of saving
> a few bytes). This makes tags go beyond their purpose of providing semantic
> meaning.
> >
> > I do not agree at all that using a tag to transfer information is abuse.
> > Tags are a significant innovation that make a lot of things much easier
> than they are in other formats.
> > (They also can be overused, which is why CDDL has “unwrap” for tags, and
> why the comment mentioned above was made.)
> >
> >> Sorry if this comes across as a rant. It's not my intention to belittle
> anyone. I just wanted to share my challenges in implementing CBOR for my
> C++ serialization library. Perhaps someone here will have advice on how I
> can overcome these challenges.
> >
> > See my algorithm proposal for converting negative bignums into magnitude
> and back, above.
> >
> > Grüße, Carsten
> >
> > [1]: https://www.dreamsongs.com/Files/HOPL2-Uncut.pdf
> >
> > _______________________________________________
> > CBOR mailing list
> > CBOR@ietf.org
> > https://www.ietf.org/mailman/listinfo/cbor
>
>