Re: [Multiformats] Multiformats Considered Harmful

Aaron Goldman <goldmanaaron@gmail.com> Mon, 18 September 2023 18:49 UTC

Return-Path: <goldmanaaron@gmail.com>
X-Original-To: multiformats@ietfa.amsl.com
Delivered-To: multiformats@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D4B54C14F6EC for <multiformats@ietfa.amsl.com>; Mon, 18 Sep 2023 11:49:53 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.105
X-Spam-Level:
X-Spam-Status: No, score=-2.105 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 2h9GbKRdCD4A for <multiformats@ietfa.amsl.com>; Mon, 18 Sep 2023 11:49:49 -0700 (PDT)
Received: from mail-oa1-x2e.google.com (mail-oa1-x2e.google.com [IPv6:2001:4860:4864:20::2e]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 85C00C151073 for <multiformats@ietf.org>; Mon, 18 Sep 2023 11:49:49 -0700 (PDT)
Received: by mail-oa1-x2e.google.com with SMTP id 586e51a60fabf-1d6b5292aebso1532225fac.1 for <multiformats@ietf.org>; Mon, 18 Sep 2023 11:49:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1695062988; x=1695667788; darn=ietf.org; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=OQWgLFhC6Hp907YetHNb1aOFqT/Sz+oXwBmEfKGW3m0=; b=XdP107X/YOVHOu20NRuVWn84S071Z3kKG82tIE64BuXsj6qIia0CMWoWcRzEWeo49h VH3h+15ZkOAQU1rCZXVwh8Wm9Q637g6z7NsLntvnWj0/og3QVTtXRKJ07SPCEa0/LBnm zqm1BR1yuVsYgUJ6GIEvVSbs97G/Z1ih4/lrs6QJN8qSpXUlx5RQlsQ7riz2c6zokG6g tU3uTh0PMi/P0zgpQauqaDHY/zTP2fgLNaLau0+vLkFiFEDFWSvjMpH4CHDr52hOb1ob VRAeC7pqn/dOlBqD6J7JdAezW6X9JcdVe/ZQDo/QWhtE4ySstnfBOc5AWn35uclw5Kme WTDA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695062988; x=1695667788; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=OQWgLFhC6Hp907YetHNb1aOFqT/Sz+oXwBmEfKGW3m0=; b=hHLQKRf3FvunGUTCCxMZtk4LKIC1DU+9z9+Yb2A6K/fe74K7qcYv9LoQOqto2kHJdl NlXjsEZzvppBit1oKhhKsle+jffrqrC+kFrcXk+dpftfZgwqVzlYuhBBClQqMuCt4gv7 g0RK6m2FB9WAPLnwphIN4dfFBC1bk5fcdcAmHy8kOo5LyWv/j0WfchkLv/ZQII+ncSC+ 7QydFMXnxDthKzZ5Yc/PT2UxsR7h6oL/5S0uw3E7Rfy9bkIH1SawurQVhG5zNHbNVimh KqmBpuxFQK3NLlTRGO0R+WsQqY9a20oK0sbDGP/aK5BptUA/TVLyQ2RFFsL2dqC1/dvB 8Tog==
X-Gm-Message-State: AOJu0Yw+n7LGzMqkt/Iszhl3BPlFXeKeUq+dFawe3ttK6TpP+18G7zwQ 4pMNmejzzx/ZGf9ScwI9wYPnGEk6M3FLm97G+hRg/8TneOE=
X-Google-Smtp-Source: AGHT+IF9J2t/ueTMHqWgtJR3kmO4hovszXFL9mAIIq05AjFNz7mQBZnmAFiSPq1ol3xGeM0jX0zsewHsxOGY+4xuH7A=
X-Received: by 2002:a05:6870:5cce:b0:1d6:b404:a50a with SMTP id et14-20020a0568705cce00b001d6b404a50amr8120730oab.31.1695062988305; Mon, 18 Sep 2023 11:49:48 -0700 (PDT)
MIME-Version: 1.0
From: Aaron Goldman <goldmanaaron@gmail.com>
Date: Mon, 18 Sep 2023 11:49:37 -0700
Message-ID: <CAE6sXqh5=xp1YG5ZKfcNp=_OfUTOJ7Q050U_JeUQMgR7_H3FUQ@mail.gmail.com>
To: multiformats@ietf.org
Content-Type: multipart/alternative; boundary="0000000000004a41380605a69b8a"
Archived-At: <https://mailarchive.ietf.org/arch/msg/multiformats/w-etaODmYp82mfRa9FU7yo12ia0>
Subject: Re: [Multiformats] Multiformats Considered Harmful
X-BeenThere: multiformats@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Discussion related to the various Multiformats data formats <multiformats.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/multiformats>, <mailto:multiformats-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/multiformats/>
List-Post: <mailto:multiformats@ietf.org>
List-Help: <mailto:multiformats-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/multiformats>, <mailto:multiformats-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Sep 2023 18:49:53 -0000

Sorry for the Markdown was composed with GitHub Markdown.
https://github.com/msporny/charter-ietf-multiformats/issues/2

### 1.
> Multiformats institutionalize the failure to make a choice, which is the
opposite of what good standards do. Good
> standards make choices about representations of data structures resulting
in interoperability, since every
> conforming implementation uses the same representation. In contrast,
Multiformats enable different implementations
> to use a multiplicity of different representations for the same data,
harming interoperability.
> [
datatracker.ietf.org/doc/html/draft-Multiformats-Multibase-03#appendix-D.1](https://datatracker.ietf.org/doc/html/draft-Multiformats-Multibase-03#appendix-D.1)
> defines 23 equivalent and non-interoperable representations for the same
data!

**Multibase** specifically and **Multiformats** more generally are
standards for decoupling. A good example of a decoupling
standard is [IPv4](
https://en.wikipedia.org/wiki/Internet_Protocol_version_4)/[IPv6](https://en.wikipedia.org/wiki/IPv6
)
and the [IP protocol numbers](
https://en.wikipedia.org/wiki/List_of_IP_protocol_numbers). IPv4 has
`Protocol` and
IPv6 has the `Next Header` but they share the same [IANA registry](
https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml).
We could call this a "failure to make a choice" as IP did not choose the
format of the layers above and below IP, or
we could view it as a deliberate decoupling of the layers of the network
stack. Whether it was a good or bad design,
it did enable innovation in what types of content IP is capable of
encapsulating. There are 146 protocols in the
registry and some routers don't implement them all, just preferring ICMP,
UDP, and TCP but IPv4/IPv6 have still
proved useful.

The **Multibase** standard solves the problem of representing bytes in text
strings with restricted character sets,
without needing to know in advance what the restrictions will be. This is
independent and separate from all the
other **Multiformat** standards.

The **Multiformat** standard solves the problem of providing a "tag" to
specify what the next "value" is, same as IPv4's
`Protocol` header or HTTP's `Content-Type` header.

### 2.
> The stated purpose of "[Multibase](
https://www.ietf.org/archive/id/draft-Multiformats-Multibase-08.html)" is
> "Unfortunately, it's not always clear what base encoding is used; that's
where this specification comes in. It
> answers the question: Given data ‘d' encoded into text ‘s', what base is
it encoded with?", which is wholly
> unnecessary. Successful standards DEFINE what encoding is used where. For
instance,
> [
rfc-editor.org/rfc/rfc7518.html#section-6.2.1.2](https://www.rfc-editor.org/rfc/rfc7518.html#section-6.2.1.2)
> defines that "x" is base64url encoded. No guesswork or prefixing is
necessary or useful.

Some standards do specify a specific encoding. **Multibase** will not
prevent any past or future standard from specifying
that a text field is `Base64url`, for example. It dose enables future
standards to specify that bytes are encoded as a
**Multibase** string.

**Multibase** is a set of encodings that will allow an array of bytes to be
encoded as text with restriction on character
set that may not always be known in advance. If we had a protocol that had
a 32-byte number, and we needed to represent
those bytes as text, we could represent them as:

| Base            | Literal


                                                   |
|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| b256(bytes)     | (non-ascii bytes not representable here)


                                                    |
| b85             | <FLd+nEV_Rn)~#~nQyryC$2%{WSf&rq?MT)cv84k


                                                    |
| b64             | 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=


                                                    |
| b32             |
4OYMIQUY7QOBJGX36TEJS35ZEQT24QPEMSNZGTFESWMRW6CSXBKQ====


                                |
| b16             |
E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855


                                |
| integer_literal |
0xe3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855


                                |
| integer_literal |
102987336249554097029535212322581322789799900648198034993379397001115665086549


                            |
| integer_literal |
0o16166061041230770160244657576462114557562220475344074431115623231222254621557024534125


                  |
| integer_literal |
0b1110001110110000110001000100001010011000111111000001110000010100100110101111101111110100110010001001100101101111101110010010010000100111101011100100000111100100011001001001101110010011010011001010010010010101100110010001101101111000010100101011100001010101
|

By using an integer literal, I can both describe the number and the base
that the number is represented in. In this
case, we represent hex in a text that only needs to be able to support
`0123456789abcdefx`, binary with just `01b`, and
so on. **Multibase** takes this further by requiring that the first byte
(indicating the base) is one of the bytes from the
alphabet of the encoding. This way we don't add a character requirement for
no value.

An example of this adding value is when **Multibase** was chosen for IPFS
CIDs. The CIDs were traditionally in `base58btc`,
which is case-sensitive. This worked well for representing bytes in the
restricted text environment of file paths and
URI paths. This could have easily been specified as a `base58btc` string,
but fortunately they chose **Multibase** to
decouple the bytes of the CID from the string representation. When the time
came that they wanted to put CIDs into
subdomains, the case-insensitive subdomains were a _more_ restricted text
environment that they had not anticipated. They
switched to `base32` which was not case-sensitive and thus able to
represent the same bytes in a more restricted
environment.

**Multibase** is orthogonal to **Multiformats** and should be standardized
as a way to represent bytes in a restricted text
environment that is restricted in ways that are irrelevant to the bytes
being represented. If we don't know whether
our data will need to be represented as compact arbitrary bytes, 7-bit safe
ascii, JSON non-escaped ascii, CSV
non-escaped ascii, TSV non-escaped ascii, URL path-safe ascii,
domain-name-safe ascii, decimal numbers only, some
not yet known but soon to be important environment, etc. then encoding the
bytes as **Multibase** has decoupling value.

### 3.
> Standardization of Multiformats would result in unnecessary and unhelpful
duplication of functionality – especially
> of key representations. The primary use of Multiformats is for
"publicKeyMultibase" – a representation of public
> keys that are byte arrays. For instance, the only use of Multiformats by
the [W3C DID spec](https://www.w3.org/TR/did-core/)
> is for publicKeyMultibase. The IETF already has several perfectly good
key representations, including X.509, JSON
> Web Key (JWK), and COSE_Key. There's not a compelling case for another
one.

The standardization of **Multiformats** is independent of whether IETF
chooses to standardize `publicKeyMultibase`.

For example, the IPv4 `Protocol` header registers `70` `VISA` `VISA
Protocol`. This does not imply that IETF needs to
specify [VISA Protocol](
https://en.wikipedia.org/wiki/Virtual_instrument_software_architecture). In
fact, as far as we
can tell, it is the IVI Foundation that maintains that standard. In the
exact same way, the only interaction between
**Multiformats** standardization and `publicKeyMultibase` is that
`publicKeyMultibase` could use the **Multiformats**
registry to map numbers to key representations. Any flaws in
`publicKeyMultibase` are no better an argument against
standardization of **Multiformats** than the flaws in VISA Protocol are
against standardization of IPv4 and the IANA
protocol-numbers registry.

If X.509, JSON Web Key (JWK), or COSE_Key become the standard way to
represent keys for the web then `publicKeyMultibase`
could just add a **Multiformats** registry entry for X.509 or JWK, and
`publicKeyMultibase` would just be a wrapper around
those representations. COSE is already present in the registry.

### 4.
> publicKeyMultibase can only represent a subset of the key types used in
practice. Representing many kinds of keys
> requires multiple values – for instance, RSA keys require both an
exponent and a modulus. By comparison, the X.509,
> JWK, and COSE_Key formats are flexible enough to represent all kinds of
keys. It makes little to no sense to
> standardize a key format that limits implementations to only certain
kinds of keys.

Please see above. `publicKeyMultibase` is outside the scope of this working
group, which is [tasked](
https://github.com/msporny/charter-ietf-multiformats)
with producing the following artifacts:
> 1. An RFC specifying multibase usage
> 2. An RFC defining an independent multibase registry and populating it
with today's already-implemented stable and final
values
> 3. An RFC defining a registry-group for all the multicodecs, empty at
inception, with registration process and group-wide
constraints on registration values
> 4. An RFC specifying multihash usage
> 5. An RFC defining a multihash registry within the multicodecs registry
group and populating it with today's
already-implemented stable and final values

The **Multiformat-varint** spec is also pulled in as it is needed to
specify the length in **Multihash** and **Multiformat** with
sized payloads.

### 5.
> The "[multihash](
https://www.ietf.org/archive/id/draft-Multiformats-multihash-07.html)"
specification relies on a
> non-standard representation of integers called "Dwarf". Indeed, the
referenced Dwarf document lists itself as being
> at [http://dwarf.freestandards.org](http://dwarf.freestandards.org/) – a
URL that no longer exists!

We agree here - the **Multiformats-varint** is close to but not exactly
Dwarf. This is due to the fact that the
**Multiformats-varint** is limited to 9 bytes. It is a 1-to-9 byte
representation of an unsigned int63. from 0x00(0)
to 0x7FFFFFFF_FFFFFFFF(9223372036854775807) this means the decoded value
will always fit in either a signed int64 or
an unsigned int64. If the most-significant-bit of a byte is 0, this is the
last byte of the **Multiformats-varint**. If it
is 1, there is at least one more byte present in the
**Multiformats-varint**. The 7 remaining bits are the payload bits.
You can shift the payload bits left by `7 * (byte number)` and `|`
(bitwise-OR) them in to get the decoded number.
```
| length in bytes | Encoded bits | Bits
                                        |
|-----------------|--------------|----------------------------------------------------------------------------------|
| 1               | 7            | 0xxxxxxx
                                        |
| 2               | 14           | 1xxxxxxx 0xxxxxxx
                                         |
| 3               | 21           | 1xxxxxxx 1xxxxxxx 0xxxxxxx
                                        |
| 4               | 28           | 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx
                                         |
| 5               | 35           | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx
0xxxxxxx                                     |
| 6               | 42           | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx
1xxxxxxx 0xxxxxxx                            |
| 7               | 49           | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx
1xxxxxxx 1xxxxxxx 0xxxxxxx                   |
| 8               | 56           | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx
1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx          |
| 9               | 63           | 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx
1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx |
|                 |              |  7     0, 14    8, 21   15, 28   22, 35
  23, 42   36, 49   43, 56   50, 63   57 |
```
**Multiformats-varint** is such a simple varint that there is no reason to
point anywhere else. The **Multiformats-varint**
should be specified by this working group alongside **Multibase** and
**Multihash**. Any reference to Dwarf is simply
unnecessary as it is clearer to specify **Multiformats-varint** rather than
trying to describe it relative to a similar
but non-identical varint.

### 6.
> The "Multihash Identifier Registry" at [
ietf.org/archive/id/draft-Multiformats-multihash-07.html#mh-registry](https://www.ietf.org/archive/id/draft-Multiformats-multihash-07.html#mh-registry)
> duplicates the functionality of the IANA "Named Information Hash
Algorithm Registry" at
> [
iana.org/assignments/named-information/named-information.xhtml#hash-alg](https://www.iana.org/assignments/named-information/named-information.xhtml#hash-alg)
,
> in that both assign (different) numeric identifiers for hash functions.
If multihash goes forward, it should use
> the existing registry.

"Not all uses of these names require use of the full hash output --
truncated hashes can be safely used in some
environments.  For this reason, we define a new IANA registry for hash
functions to be used with this specification so
as not to mix strong and weak (truncated) hash algorithms in other protocol
registries."
-- [rfc6920: Naming Things with Hashes](https://www.iana.org/go/rfc6920)

The goal of the named-information registry is to be a hash function and
prefix length for the binary encoding of a
`ni://` or a `nih://`. This is limited to a 6-bit field but the
**Multiformats** registry intends to support more than 64
algorithm/size pairs.

| hash      | sizes |
|-----------|-------|
| identity  | 1     |
| sha1      | 1     |
| sha2      | 9     |
| sha2a     | 1     |
| sha3      | 4     |
| keccak    | 5     |
| blake3    | 1     |
| md4       | 1     |
| md5       | 1     |
| blake2b   | 64    |
| blake2s   | 32    |
| skein256  | 32    |
| skein512  | 64    |
| skein1024 | 128   |

We can't fit hundreds of hash function length pairs in a 64-entry registry.
This would break backwards compatibility
because it changes which numbers match which hash functions. It pollutes
the registry for rfc6920 implementors by
including non-cryptographically secure hash functions. Lastly, the
**Multiformats** registry already contains more than
64 hash functions and would not fit in the Named Information Hash Algorithm
Registry.

It is better to have hash function and length as two different fields as in
**Multihash**.

### 7.
> It's concerning that [the draft charter](
https://msporny.github.io/charter-ietf-Multiformats/) states that
> "Changing current Multiformat header assignments in a way that breaks
backward compatibility with production
> deployments" is out of scope. Normally IETF working groups are given free
rein to make improvements during the
> standardization process.

This may be a distinction without a difference. We certainly could empower
the working group to make backwards
incompatible changes, but they will try not to have any unnecessary
breaking changes.

### 8.
> Finally, as a member of the W3C DID and W3C Verifiable Credentials
working groups, I will state that it is
> misleading for the draft charter to say that "The outputs from this
Working Group are currently being used by … the
> W3C Verifiable Credentials Working Group, W3C Decentralized Identifiers
Working Group…". The documents produced by
> these working groups intentionally contain no normative references to
Multiformats or any data structures derived
> from them. Where they are referenced, it is explicitly stated that the
references are non-normative.

This is a good note. The draft charter should probably be clear that
**Multiformats** are being used in Verifiable
Credentials and Decentralized Identifiers in production. There are multiple
existing independent implementations
of this technology enabling Verifiable Credentials and Decentralized
Identifiers to be useful. While these specs
contain no normative references, this registry provides the ability to make
Verifiable Credentials and Decentralized
Identifiers that are better decoupled from the data structures that they
contain, and will therefore be flexible in
the face of future evolution.