[Cbor] Base-8 OIDs (or, where not to optimize)

Carsten Bormann <cabo@tzi.org> Thu, 03 September 2020 21:02 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost []) by ietfa.amsl.com (Postfix) with ESMTP id 292A73A0F21 for <cbor@ietfa.amsl.com>; Thu, 3 Sep 2020 14:02:14 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([]) by localhost (ietfa.amsl.com []) (amavisd-new, port 10024) with ESMTP id 6P4fhwYgNIYo for <cbor@ietfa.amsl.com>; Thu, 3 Sep 2020 14:02:11 -0700 (PDT)
Received: from gabriel-vm-2.zfn.uni-bremen.de (gabriel-vm-2.zfn.uni-bremen.de []) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1094C3A0DEC for <cbor@ietf.org>; Thu, 3 Sep 2020 14:02:11 -0700 (PDT)
Received: from [] (p5089ae91.dip0.t-ipconnect.de []) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gabriel-vm-2.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4BjCt50FdJzyS1; Thu, 3 Sep 2020 23:02:05 +0200 (CEST)
From: Carsten Bormann <cabo@tzi.org>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Mao-Original-Outgoing-Id: 620859724.429252-ab0e9e01f50139c423a779aaf3df2f9b
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.\))
Date: Thu, 3 Sep 2020 23:02:04 +0200
Message-Id: <14A57296-1D6E-4771-A59C-623CE689C2D2@tzi.org>
To: cbor@ietf.org
X-Mailer: Apple Mail (2.3608.
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/9owRyOcXdsK7Ooc1S3D9-AImDdU>
Subject: [Cbor] Base-8 OIDs (or, where not to optimize)
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 03 Sep 2020 21:02:14 -0000

As a little side story about my previous message, let me explain an optimization to draft-ietf-cbor-tags-oid that I am NOT going to propose, but that would be really swell, save some 20 % of *everything* in real use cases, and be implementable on a single screen of code (well, maybe that needs to be a 4K screen then).

ASN.1 OIDs are sequences of numbers that are represented in BER using base-128 arithmetic, put into bytes so that the 8th bit of each 7-bit digit tells you whether another digit follows (documented also in RFC 6256).
ASN.1 BER has a weird special-purpose optimization for the start of an OID, where only the numbers 0, 1, 2 are currently valid: That number is multiplied by 40 and added to the next number (which is reversible because after a 0 or a 1 only numbers < 40 can follow — not so for 2, by the way).

Well, if you look at how we actually are using OIDs, you will notice that there are a lot more places where very small numbers are predominantly used.

So how about getting rid of the special-purpose optimization, and apply the following more general optimization: represent OIDs as sequences of nibbles (4-bit units) instead of bytes, and use base 8 instead of base 128 for what is otherwise exactly the same scheme (**).  This is obviously a (mild) pessimization for numbers between 64 and 127 (1 byte to three nibbles), 4096 and 16383 (two bytes to five nibbles), etc.  But so many other components of an OID get so much shorter.  The effective number of bits per byte for large numbers shrinks from 7 to 6 (3+3), but because these OID component numbers actually are distributed in a way that is rather close to Zipf’s law [1], the actual number of bytes needed per OID number goes down from a little more than 1 to much less than 1.

If I count right, the popular OID has 9 places that are now one nibble shorter and one that is 1 nibble longer, so we go from 13 bytes (with special case for first number) to 18 nibbles (without special case), i.e., 9 bytes.

Also, we get rid of the wart for absolute OIDs and need only one type of encoding for both absolute and relative OIDs.  (Well, maybe we can ignore this simplification if it is desired to stay closer to ASN.1 BER.)

((**) We have to add a much smaller wart when the sequence of nibbles turns out to be an odd number of nibbles — that half-unused last byte would then get some padding that cannot otherwise occur in a last nibble, e.g., 0b1000, to indicate that the last nibble is not in use.)

It took approximately one morning shower (*) to come up with this scheme; I then spent what was probably a couple more hours to write and run some code against a not quite trivial SNMP trace I happened to have lying around and got some 20+ % overall size reduction (yes, that includes some actual unoptimized data beyond all those OIDs).  I did not take the time to write this up then (six years ago, when we started preparing draft-bormann-cbor-tags-oid-00).

So there we are.  There is a simple, good, easily implementable, and rather elegant, way to fix the representation of ASN.1 OIDs.
Looking at the clunky base-128 BER way now creates appreciable cognitive dissonance for me.

$64000 question: Why am I not proposing that we standardize this optimization?  

Grüße, Carsten

(*) A unit of innovation that is quite well-known to quite a few people I’ve been working with.

(**) see text above

[1]: https://en.wikipedia.org/wiki/Zipf%27s_law