[Cbor] Challenges in supporting bignum tag

Emile Cormier <emile.cormier.jr@gmail.com> Wed, 01 December 2021 23:38 UTC

Return-Path: <emile.cormier.jr@gmail.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 210743A0CFA for <cbor@ietfa.amsl.com>; Wed, 1 Dec 2021 15:38:38 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.098
X-Spam-Level:
X-Spam-Status: No, score=-2.098 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id nu60VSHQWcnl for <cbor@ietfa.amsl.com>; Wed, 1 Dec 2021 15:38:35 -0800 (PST)
Received: from mail-yb1-xb2d.google.com (mail-yb1-xb2d.google.com [IPv6:2607:f8b0:4864:20::b2d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 30A283A0CF6 for <cbor@ietf.org>; Wed, 1 Dec 2021 15:38:35 -0800 (PST)
Received: by mail-yb1-xb2d.google.com with SMTP id f186so68153086ybg.2 for <cbor@ietf.org>; Wed, 01 Dec 2021 15:38:35 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:from:date:message-id:subject:to; bh=4htvDYwNCDhO+kT+ZarIi2Me0hvnN7VeqakU5LpTt2g=; b=Ctif0P8/NvKHXxMo8NqbX3n4LYvlkWSNgH1+Vq4Q5H+glwLlRQOnBmEqz2HpbMJT9x uPGG4qrGXXKK3ixDwp8OhWgfqz8oWBpqi/J7Sq/hs5f1floTFwmGOGds2dJp3arxgxhn ScC/VWvyORiH/RqmvVKU8x6daYDt+r5UJkd3UTcrgeDGolQOcIOJI7U4SLZYFXdfLkiD zvbm66/5UpdE92H7XcBc01TnZJTqaoytAq4XAvgrNdCrN/RnEspIj2WT6jluAGQK+hWb 5HOtvLyBDOFMhOBS74qLFRxvuNrV7pbXatwaQAVrVL8QYm+1QfqJ4zahgCWutxIHKKoD OelA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=4htvDYwNCDhO+kT+ZarIi2Me0hvnN7VeqakU5LpTt2g=; b=aVSCdRoOR36thZZrTg1rk0L4rS/APYI9r/hZDlKfnBXhxfb05g13IyTHkcItRQwr8o 6CK+wkjq4APTFeBmZix2kVpg1mbHk0pUw+4Q9ZXNsXAVQm+k7XMdpOcMEkklOnYVM9Ye esTWdcqqP/NHowL2doWM1sPZ4DtFH7m+V+Jyk99Savfn3sw6fw8Ooh087i3+jlHAv4vo lmVPeryT/vT9B3t3miubSPOYwNcRSxDL3dQVI1pkiduIbjcJAUNwj/VQy7upXH29x2o1 SidrWOtTzuGWU5O7yRk1+MT6Sbf96SjH9FKe3g0S2lScB0C2zcZSaAzRhVu3zj6ek9Go GexA==
X-Gm-Message-State: AOAM531MMOmcTHO24lvr3itXWpQ1t32wgLZaki9OgMC4wDxAh3+WnyN0 5iJmnxUX+POQ8QIAPAmdN+PHK6KhqYRLAx/8KAhfbKNmEBI=
X-Google-Smtp-Source: ABdhPJyU1jj/WtChQs0/rY8hQ+DETHF80hE8rla1QxHlSDyGLgvmZZDVILz2vURGnCx0YWedqWAFX+3qFnxUdoRXeRM=
X-Received: by 2002:a25:bf8f:: with SMTP id l15mr11116786ybk.670.1638401911474; Wed, 01 Dec 2021 15:38:31 -0800 (PST)
MIME-Version: 1.0
From: Emile Cormier <emile.cormier.jr@gmail.com>
Date: Wed, 1 Dec 2021 19:38:20 -0400
Message-ID: <CAM70yxDvL1HKZSz8FMwFb64Tvse6TrqUwNgTmeY7kbMYGazDjg@mail.gmail.com>
To: cbor@ietf.org
Content-Type: multipart/alternative; boundary="000000000000eece7d05d21e2b53"
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/rBkvIe6h-RXqaswZ0qyz01AdcBs>
Subject: [Cbor] Challenges in supporting bignum tag
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 01 Dec 2021 23:38:38 -0000

Hi Everyone,

I'm in the process of supporting CBOR "bignums" (which I think should have
been named "bigint") in my C++ serialization library, and have encountered
the following challenges:

1. When transcoding to other binary formats (such as MessagePack), the sign
of the bignum is lost because it is deduced by the tag value and is not
part of the message payload. I have to resort to converting it to [sign,
byte_array], which makes it no longer round-trippable to its original CBOR
format.

2. In CBOR decoders that ignore tags, the sign of the bignum is lost when
passed up to the application. This goes against the following passage in
section 3.4 of the spec:

*"Decoders do not need to understand tags of every tag number, and tags may
be of little value in applications where the implementation creating a
particular CBOR data item and the implementation decoding that stream know
the semantic meaning of each item in the data flow."*

So even if the application knows the semantic meaning of the received byte
array (it's expecting a bignum), it cannot reconstitute the sign of the
received magnitude without knowing the tag.

Note that the same sort of information loss occurs with typed arrays,
although it is more likely that an application would know the endianness
and element size of a received typed array.

3. Most arbitrary precision libraries represent big integers in signed
magnitude format. CBOR wants negative values to be stored as (-1 - N). I
therefore need to subtract one from the magnitude stored in a negative
multiprecision integer. Because the multiprecision subtraction algorithm
needs to start from the least significant byte (to propagate carries), I
have to perform subtraction on the entire bigint before I can serialize it
to CBOR. This is because CBOR wants the bytes to be in big endian order.
It's therefore not possible to process and serialize each source byte "on
the fly" starting from the most significant byte. If I'm serializing a huge
bigint, then there's a space doubling penalty in the temporary bigint
needed for the subtraction.

4. Because of #3, my CBOR codec implementation now depends on
multiprecision integer arithmetic. I'm now forced to do one of the
following:

a) Add a dependency to a multiprecision arithmetic library, which may not
be the same one that's used by the application. For example, my library
depends on Boost.Multiprecision, but the application wants to serialize a
GMP bignum. The application now ends up with dependencies on *two*
multiprecision libraries - yuck!

b) Write the multiprecision subtraction by hand to avoid introducing a
dependency on a multiprecision arithmetic library. While the algorithm
isn't too complicated, what I write is certain to be slower than a
multiprecision library.

c) Require that the intermediary byte format passed to the encoder be the
same as what CBOR expects (which may not be the same format as what other
binary serialization formats want). For example, the header in my library
which adds GMP serialization support could use GMP itself to perform the -1
subtraction and pass those bytes to the encoder. I should note here that
the encoder interface in my library is agnostic of the serialization
format. That is, the JSON encoder interface is the same as the CBOR encoder
interface. This makes it easy for the application to serialize to different
formats (JSON, CBOR, etc) without having to change anything in the
application's serialization code.


In short, my life would be much easier if CBOR bignums were different in
the following way:

- The sign is part of the message payload, and not "deduced" by the tag
value,
- The bytes represent the magnitude of the bignum: no subtraction/addition
by 1 for negative values.

There are other tags that currently depend on the existing bignum tag
(bigdecimal, bigfloat, extended time), so I don't expect there to be much
interest in a new "sane bignum" tag that addresses the "flaws" I brought up
here. I'm just documenting the challenges I'm facing with the existing
bignum tag so that they're considered if there's ever a fork of this
standard, or if some other binary protocol wants to implement bignums.

Overall I feel there's too much emphasis in CBOR on cramming everything
into as few bytes as possible, and not enough on decoding simplicity and
speed. I also think some tags have been "abused" for storing information
that should otherwise be part of the message payload (all in the name of
saving a few bytes). This makes tags go beyond their purpose of providing
semantic meaning.

Sorry if this comes across as a rant. It's not my intention to belittle
anyone. I just wanted to share my challenges in implementing CBOR for my
C++ serialization library. Perhaps someone here will have advice on how I
can overcome these challenges.

Cheers,
Emile Cormier