[Cbor] NaN payload notation (Re: 7049bis: Diagnostic notation gaps)

Obviously, we need to add something to enable diagnostic notation to represent NaN payloads.

First of all, I wouldn’t mind doing this in a separate document, so we don’t tie getting experience with such a notation into the “real soon now” schedule for 7049bis.

Discussing the technical merits of the various proposals:

We should make sure that the parsing doesn’t become complicated; that was one reason why having the “x” in there might be helpful.

Also, I would like the values that do not carry a _ to be independent of length, so “nan” would continue to stand for 0x7e00, 0x7f800000, etc.
(Note that there are some people that like those to stand for 0x7fff/0x7fffffff — one reason is that the negative NaN becomes 0xffffffff, which is also the result of some SIMD operations.)

I like the approach of enabling the dumping/loading of floating point values without understanding floating point at all.  Hex floats (part of EDN in RFC 8610) go a long way, but specifically do not address NaNs.  If we use something that looks like a number for that (0x…), we need to require length indicators (_1 _2 _3).  Maybe dumping the whole item in hex, including the head (f9/fa/fb), is the most versatile extension of DN/EDN that we can make.  If we go this way, we probably should make up a somewhat jarring syntax so this becomes readily visible and is visually *very* distinct from hexadecimal CBOR-in-a-byte-string (h’f97c00’).

Grüße, Carsten

> On 2020-09-11, at 23:30, Thiago Macieira <thiago.macieira@intel.com> wrote:
> 
> On Friday, 11 September 2020 13:46:10 PDT Carsten Bormann wrote:
>> * (_ ) is ambiguous whether it refers to 0x5fff (an indefinite length byte
>> string with no chunks) or 0x7fff (an indefinite length text string with no
>> chunks).  The proposal in #205 is to disambiguate this by adding
>> characters, (_b ) vs. (_t ).
>> 
>> It occurred to me that a better way to represent these two encoded CBOR data
>> items would be to use ‘’_ and “”_ (please excuse the smart quotes).  This
>> is already allowed by the text in RFC 7049; we would just need to point it
>> out by adding a sentence to the last paragraph of § 8.1.  Note that in RFC
>> 7049 these notations are in principle ambiguous, i.e. ‘’_ could be a (_ )
>> (0x5fff) or (_ ‘’) or (_ ‘’ ‘’) and so on; we would probably just clarify
>> that these are to be used for (_ ) only as the other ones already have good
>> diagnostic notation.
> 
> That would mean needing to keep a state whether the payload was empty or not 
> and backtrack. A diagnostic printer currently can see the 0x5f or 0x7f and 
> print "(_ ", then continue on. In order to print ""_ for 0x7fff, it needs to 
> forego printing the parenthesis opening.
> 
> The simplest solution is to add the "b" or "t" unconditionally.
> 
>> (The payload here is the sign bit and the 11/24/53 significand [sic!] bits;
>> in the binaryxx formats there is an exponent intervening of 5/8/11 bits
>> which must be all ones, and the significand may not be all zeroes — those
>> are the values for the two infinities, so only ~ 53.9999999999999998398
>> bits are actually available.)
>> 
>> I think this is ugly as hell, but it also is the best proposal so far.
> 
> It would be far easier to dump the binary representation of the entire 
> floating point number, including the exponent bits. If this format were 
> allowed, a diagnostic printer with no support for stringifying floating point 
> numbers could use it for other values too.
> 
> Second option would be to use it like gdb prints:
> 
> 	-nan(0x20000)
> 
> The type information can be encoded as either "nanf" (float) and "nanf16" (for 
> _Float16).
> 
> If parentheses are not a good idea, then a third option is to print only the 
> significand bits, dropping the "0x" too. It's clearly hexadecimal, we don't 
> need the "x". In that case, we'd use the _ modification to indicate encoding 
> length, as "-nan_001_1" for a negative 16-bit signalling NaN.
> 
> Either way, the document should advise what to do when it comes to signalling 
> / quiet NaNs. The CBOR spec recommends that only the IEEE-recommended non-
> signalling form be used
> 
> "If NaN is an allowed value, it must always be represented as 0xf97e00."
> 
> The table in Appendix A also lists 0xfa7fc00000 and 0xfb7ff8000000000000, 
> which match the IEEE recommendations for QNaN. But the Wikipedia page warns 
> that some (older?) machines invert it and encode signalling NaN with the 
> topmost mantissa bit set.
> 
> I would recommend that it print "nan" only if it's one of those three. And if 
> it's distinguishing the payload length, "nan" is only for the double-precision 
> case, the others requiring "nanf" or "nanf16" (if using the variant with 
> parentheses), or for the 16-bit with "nan_2" and "nan_3" for the single- and 
> double-precision ones respectively if not using parentheses.
> 
> -- 
> Thiago Macieira - thiago.macieira (AT) intel.com
>  Software Architect - Intel DPG Cloud Engineering
> 
> 
> 
> _______________________________________________
> CBOR mailing list
> CBOR@ietf.org
> https://www.ietf.org/mailman/listinfo/cbor