[Cbor] NaN payload notation (Re: 7049bis: Diagnostic notation gaps)

Carsten Bormann <cabo@tzi.org> Wed, 16 September 2020 12:08 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 63DC73A0A16 for <cbor@ietfa.amsl.com>; Wed, 16 Sep 2020 05:08:11 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.919
X-Spam-Level:
X-Spam-Status: No, score=-1.919 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_MSPIKE_H4=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id maNzGnMXTV37 for <cbor@ietfa.amsl.com>; Wed, 16 Sep 2020 05:08:08 -0700 (PDT)
Received: from gabriel-vm-2.zfn.uni-bremen.de (gabriel-vm-2.zfn.uni-bremen.de [134.102.50.17]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 131783A0A06 for <cbor@ietf.org>; Wed, 16 Sep 2020 05:08:07 -0700 (PDT)
Received: from [172.16.42.104] (p5089ae91.dip0.t-ipconnect.de [80.137.174.145]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gabriel-vm-2.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4BrzPx07vGz106t; Wed, 16 Sep 2020 14:08:04 +0200 (CEST)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.1\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <1973898.N1gx0QA8IB@tjmaciei-mobl1>
Date: Wed, 16 Sep 2020 14:08:03 +0200
Cc: cbor@ietf.org
X-Mao-Original-Outgoing-Id: 621950883.818118-df000a2d38d0fdabdf723d79b4c91fea
Content-Transfer-Encoding: quoted-printable
Message-Id: <4933A00D-CD85-405D-BDEB-10F06C6E4673@tzi.org>
References: <2766F4E6-0E67-472B-8BFA-75C529F4EE80@tzi.org> <1973898.N1gx0QA8IB@tjmaciei-mobl1>
To: Thiago Macieira <thiago.macieira@intel.com>
X-Mailer: Apple Mail (2.3608.120.23.2.1)
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/bKaZKPV0tBcatfSJ8o2ppxdcJ6I>
Subject: [Cbor] NaN payload notation (Re: 7049bis: Diagnostic notation gaps)
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 16 Sep 2020 12:08:11 -0000

Obviously, we need to add something to enable diagnostic notation to represent NaN payloads.

First of all, I wouldn’t mind doing this in a separate document, so we don’t tie getting experience with such a notation into the “real soon now” schedule for 7049bis.

Discussing the technical merits of the various proposals:

We should make sure that the parsing doesn’t become complicated; that was one reason why having the “x” in there might be helpful.

Also, I would like the values that do not carry a _ to be independent of length, so “nan” would continue to stand for 0x7e00, 0x7f800000, etc.
(Note that there are some people that like those to stand for 0x7fff/0x7fffffff — one reason is that the negative NaN becomes 0xffffffff, which is also the result of some SIMD operations.)

I like the approach of enabling the dumping/loading of floating point values without understanding floating point at all.  Hex floats (part of EDN in RFC 8610) go a long way, but specifically do not address NaNs.  If we use something that looks like a number for that (0x…), we need to require length indicators (_1 _2 _3).  Maybe dumping the whole item in hex, including the head (f9/fa/fb), is the most versatile extension of DN/EDN that we can make.  If we go this way, we probably should make up a somewhat jarring syntax so this becomes readily visible and is visually *very* distinct from hexadecimal CBOR-in-a-byte-string (h’f97c00’).

Grüße, Carsten


> On 2020-09-11, at 23:30, Thiago Macieira <thiago.macieira@intel.com> wrote:
> 
> On Friday, 11 September 2020 13:46:10 PDT Carsten Bormann wrote:
>> * (_ ) is ambiguous whether it refers to 0x5fff (an indefinite length byte
>> string with no chunks) or 0x7fff (an indefinite length text string with no
>> chunks).  The proposal in #205 is to disambiguate this by adding
>> characters, (_b ) vs. (_t ).
>> 
>> It occurred to me that a better way to represent these two encoded CBOR data
>> items would be to use ‘’_ and “”_ (please excuse the smart quotes).  This
>> is already allowed by the text in RFC 7049; we would just need to point it
>> out by adding a sentence to the last paragraph of § 8.1.  Note that in RFC
>> 7049 these notations are in principle ambiguous, i.e. ‘’_ could be a (_ )
>> (0x5fff) or (_ ‘’) or (_ ‘’ ‘’) and so on; we would probably just clarify
>> that these are to be used for (_ ) only as the other ones already have good
>> diagnostic notation.
> 
> That would mean needing to keep a state whether the payload was empty or not 
> and backtrack. A diagnostic printer currently can see the 0x5f or 0x7f and 
> print "(_ ", then continue on. In order to print ""_ for 0x7fff, it needs to 
> forego printing the parenthesis opening.
> 
> The simplest solution is to add the "b" or "t" unconditionally.
> 
>> (The payload here is the sign bit and the 11/24/53 significand [sic!] bits;
>> in the binaryxx formats there is an exponent intervening of 5/8/11 bits
>> which must be all ones, and the significand may not be all zeroes — those
>> are the values for the two infinities, so only ~ 53.9999999999999998398
>> bits are actually available.)
>> 
>> I think this is ugly as hell, but it also is the best proposal so far.
> 
> It would be far easier to dump the binary representation of the entire 
> floating point number, including the exponent bits. If this format were 
> allowed, a diagnostic printer with no support for stringifying floating point 
> numbers could use it for other values too.
> 
> Second option would be to use it like gdb prints:
> 
> 	-nan(0x20000)
> 
> The type information can be encoded as either "nanf" (float) and "nanf16" (for 
> _Float16).
> 
> If parentheses are not a good idea, then a third option is to print only the 
> significand bits, dropping the "0x" too. It's clearly hexadecimal, we don't 
> need the "x". In that case, we'd use the _ modification to indicate encoding 
> length, as "-nan_001_1" for a negative 16-bit signalling NaN.
> 
> Either way, the document should advise what to do when it comes to signalling 
> / quiet NaNs. The CBOR spec recommends that only the IEEE-recommended non-
> signalling form be used
> 
> "If NaN is an allowed value, it must always be represented as 0xf97e00."
> 
> The table in Appendix A also lists 0xfa7fc00000 and 0xfb7ff8000000000000, 
> which match the IEEE recommendations for QNaN. But the Wikipedia page warns 
> that some (older?) machines invert it and encode signalling NaN with the 
> topmost mantissa bit set.
> 
> I would recommend that it print "nan" only if it's one of those three. And if 
> it's distinguishing the payload length, "nan" is only for the double-precision 
> case, the others requiring "nanf" or "nanf16" (if using the variant with 
> parentheses), or for the 16-bit with "nan_2" and "nan_3" for the single- and 
> double-precision ones respectively if not using parentheses.
> 
> -- 
> Thiago Macieira - thiago.macieira (AT) intel.com
>  Software Architect - Intel DPG Cloud Engineering
> 
> 
> 
> _______________________________________________
> CBOR mailing list
> CBOR@ietf.org
> https://www.ietf.org/mailman/listinfo/cbor