[Cbor] CBOR magic number, file format and tags

Michael Richardson <mcr+ietf@sandelman.ca> Wed, 20 January 2021 23:56 UTC

Return-Path: <mcr+ietf@sandelman.ca>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C647B3A1606; Wed, 20 Jan 2021 15:56:08 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id bk-K5eFzAJEV; Wed, 20 Jan 2021 15:56:06 -0800 (PST)
Received: from tuna.sandelman.ca (tuna.sandelman.ca [209.87.249.19]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 8BF953A1600; Wed, 20 Jan 2021 15:56:04 -0800 (PST)
Received: from localhost (localhost [127.0.0.1]) by tuna.sandelman.ca (Postfix) with ESMTP id 6B66538BA7; Wed, 20 Jan 2021 18:58:03 -0500 (EST)
Received: from tuna.sandelman.ca ([127.0.0.1]) by localhost (localhost [127.0.0.1]) (amavisd-new, port 10024) with LMTP id zh19_HZa0gVP; Wed, 20 Jan 2021 18:58:02 -0500 (EST)
Received: from sandelman.ca (obiwan.sandelman.ca [IPv6:2607:f0b0:f:2::247]) by tuna.sandelman.ca (Postfix) with ESMTP id 313F738BA4; Wed, 20 Jan 2021 18:58:02 -0500 (EST)
Received: from localhost (localhost [IPv6:::1]) by sandelman.ca (Postfix) with ESMTP id BFE4F72; Wed, 20 Jan 2021 18:56:01 -0500 (EST)
From: Michael Richardson <mcr+ietf@sandelman.ca>
To: cbor@ietf.org
CC: cose <cose@ietf.org>
In-Reply-To: <3C77CB5D-6AEA-4D70-96A2-3826DB8DAB18@island-resort.com>
References: <3C77CB5D-6AEA-4D70-96A2-3826DB8DAB18@island-resort.com>
X-Mailer: MH-E 8.6+git; nmh 1.7+dev; GNU Emacs 26.1
X-Face: $\n1pF)h^`}$H>Hk{L"x@)JS7<%Az}5RyS@k9X%29-lHB$Ti.V>2bi.~ehC0; <'$9xN5Ub# z!G,p`nR&p7Fz@^UXIn156S8.~^@MJ*mMsD7=QFeq%AL4m<nPbLgmtKK-5dC@#:k
MIME-Version: 1.0
Content-Type: multipart/signed; boundary="=-=-="; micalg="pgp-sha512"; protocol="application/pgp-signature"
Date: Wed, 20 Jan 2021 18:56:01 -0500
Message-ID: <10306.1611186961@localhost>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/o39ugku7I-w044xK3-WGhz6rCPQ>
Subject: [Cbor] CBOR magic number, file format and tags
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Jan 2021 23:56:09 -0000

Hi, I was thinking about this yesterday too, and after the discussion this
morning at COSE, I wrote:

         https://datatracker.ietf.org/doc/draft-richardson-cbor-file-magic/

which is at:
         https://github.com/mcr/cbor-magic-number


# Introduction

Since very early in computing, operating systems have sought ways to mark
which files could be proposed by which programs.

For instance, the Unix file(1) command, which has existed since 1973
({{file}}), has been able to identify many file formats for decades.

...

As CBOR becomes a more and more common encoding for artifacts, identifying
them as CBOR is probably not useful.

This document provides a way to encode a magic number into the beginning of a CBOR format file.
Two options are presented, with the intention of standardizing only one.

These proposals are invasive to how CBOR protocols are written to disk, but in both cases, the
proposed envelope does not require that the tag be transfered on the wire.

Some protocols may benefit from having such a magic on the wire if they
presently using a different (legacy) encoding scheme, and need to determine
before invoking a CBOR decoder if the sender is using the legacy scheme, or the new CBOR scheme.

# Requirements for a Magic Number

A magic number is ideally a unique fingerprint, present in the first 4 or 8 bytes of the file,
which does not change when the content change, and does not depend upon the length of the file.

Less ideal solutions have a pattern that needs to be matched, but in which some bytes need to be ignored.

# Proposal One

This proposal uses a CBOR Array of size two.
The first byte is therefore 0b100_00010 (0x82).

Array element number one is a CBOR integer in the range 0x80000000 to 0xffffffff.
This number is the magic number described below in {{magictable}}

For a magic number 0x87654321, this results in a total of a six byte sequence:

~~~~
  0b100_00010 0b000_11010 0x87 0x65 0x43 0x21
~~~~

Array element number two is whatever the original CBOR content is supposed to be.
Due the array construct with known size, there is no further syntax required.

# Proposal Two

This proposal uses a CBOR Sequence {{!RFC8742}}.

Array element number one is a CBOR integer in the range 0x80000000 to 0xffffffff.
This number is the magic number described below in {{magictable}}

For a magic number 0x87653412, this results in a total of a five byte sequence:

~~~~
  0b000_11010 0x87 0x65 0x34 0x12
~~~~

This is followed by one or more CBOR data items of whatever type was intended.

... and then some variations.

(I probably had more important work I should have been doing)

--
Michael Richardson <mcr+IETF@sandelman.ca>   . o O ( IPv6 IøT consulting )
           Sandelman Software Works Inc, Ottawa and Worldwide