[Cbor] 7049bis: Diagnostic notation gaps

Carsten Bormann <cabo@tzi.org> Fri, 11 September 2020 20:46 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost []) by ietfa.amsl.com (Postfix) with ESMTP id 48D1E3A0978 for <cbor@ietfa.amsl.com>; Fri, 11 Sep 2020 13:46:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.898
X-Spam-Status: No, score=-1.898 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([]) by localhost (ietfa.amsl.com []) (amavisd-new, port 10024) with ESMTP id Ttai81G0501F for <cbor@ietfa.amsl.com>; Fri, 11 Sep 2020 13:46:13 -0700 (PDT)
Received: from gabriel-vm-2.zfn.uni-bremen.de (gabriel-vm-2.zfn.uni-bremen.de []) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 08D573A0977 for <cbor@ietf.org>; Fri, 11 Sep 2020 13:46:13 -0700 (PDT)
Received: from [] (p5089ae91.dip0.t-ipconnect.de []) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gabriel-vm-2.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4Bp7831mxczygH; Fri, 11 Sep 2020 22:46:11 +0200 (CEST)
From: Carsten Bormann <cabo@tzi.org>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Mao-Original-Outgoing-Id: 621549970.7548929-ee709aab92260ea7c4183ba49becf731
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.\))
Date: Fri, 11 Sep 2020 22:46:10 +0200
Message-Id: <2766F4E6-0E67-472B-8BFA-75C529F4EE80@tzi.org>
To: cbor@ietf.org
X-Mailer: Apple Mail (2.3608.
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/7K5f0rJ-08MTg4C8s8rIwKUldOU>
Subject: [Cbor] 7049bis: Diagnostic notation gaps
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 11 Sep 2020 20:46:16 -0000

Over at https://github.com/cbor-wg/CBORbis/issues/204, we have had an interesting discussion about closing some gaps in diagnostic notation coverage (there also is a proposal in https://github.com/cbor-wg/CBORbis/pull/205 which I’m not sure how to handle).

Basically, #204 stipulates that there should be diagnostic notation for all well-formed encoded CBOR data items, which makes a lot of sense to me.  
It then identifies two gaps:

* (_ ) is ambiguous whether it refers to 0x5fff (an indefinite length byte string with no chunks) or 0x7fff (an indefinite length text string with no chunks).  The proposal in #205 is to disambiguate this by adding characters, (_b ) vs. (_t ).

It occurred to me that a better way to represent these two encoded CBOR data items would be to use ‘’_ and “”_ (please excuse the smart quotes).  This is already allowed by the text in RFC 7049; we would just need to point it out by adding a sentence to the last paragraph of § 8.1.  Note that in RFC 7049 these notations are in principle ambiguous, i.e. ‘’_ could be a (_ ) (0x5fff) or (_ ‘’) or (_ ‘’ ‘’) and so on; we would probably just clarify that these are to be used for (_ ) only as the other ones already have good diagnostic notation.

* There is only one NaN in diagnostic notation, but there are many NaN values in IEEE 754.  (This is a bit like JavaScript, where the core language only exposes one NaN value, even if extensions such as array buffers give the programmer full access to all of them.)  #205 suggests a notation:

> The IEEE 754 representation of NaN carries a "payload" of up to 54 bits,
> not all of which may be zero, so we allow an encoding indicator to specify
> the exact hex representation. Thus the standard half-precision NaN may be
> represented as `NaN`, `NaN_1`, or `NaN_1_x7E00`, while a single-precision
> NaN with an all-ones payload is represented as `NaN_2_x7FFFFFFF`. Items
> like `NaN_1_x0000` or `NaN_1_x7C00` do not encode NaN in the hex
> representation and so are not valid diagnostic notation.

(The payload here is the sign bit and the 11/24/53 significand [sic!] bits; in the binaryxx formats there is an exponent intervening of 5/8/11 bits which must be all ones, and the significand may not be all zeroes — those are the values for the two infinities, so only ~ 53.9999999999999998398 bits are actually available.)

I think this is ugly as hell, but it also is the best proposal so far.

My questions to the WG are:

1. Are we ready to go for ‘’_ and “”_, with the clarification to be added?

2. Do we believe that this is the right way to handle NaN payloads?
   Should we add this to 7049bis (which would be new functionality) or should we do this in a separate document?
   (We did this for other diagnostic notation extensions, which are listed in RFC 8610.)

Note that diagnostic notation is qualified as “not for interchange”, but its use for testing in tools and for communication in specification documents makes it rather desirable to nail it down at a good level of specificity.

Grüße, Carsten