Re: [Cbor] 7049bis: Diagnostic notation gaps

Thiago Macieira <thiago.macieira@intel.com> Fri, 11 September 2020 21:30 UTC

Return-Path: <thiago.macieira@intel.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3A6473A09CC for <cbor@ietfa.amsl.com>; Fri, 11 Sep 2020 14:30:30 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 35ZytAgyZL3L for <cbor@ietfa.amsl.com>; Fri, 11 Sep 2020 14:30:29 -0700 (PDT)
Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D928E3A09C9 for <cbor@ietf.org>; Fri, 11 Sep 2020 14:30:28 -0700 (PDT)
IronPort-SDR: IXKx2/qckgu5gnDQNz1ZFd3L7mX+7NhFfrtgwSCgfxLnr9Ue56U4OtFGUFV1ujIWe94xHC/+yG P4WIZdhMAznQ==
X-IronPort-AV: E=McAfee;i="6000,8403,9741"; a="158892314"
X-IronPort-AV: E=Sophos;i="5.76,417,1592895600"; d="scan'208";a="158892314"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 14:30:26 -0700
IronPort-SDR: e3GZAkthNNsPNxZHmnHPwWmOTwQyAshsD0pe3JXRxYBOXOuIXJn/c6eRrm/Xxa75jqSDczVAIn yb0IygMr94Tw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.76,417,1592895600"; d="scan'208";a="506406609"
Received: from orsmsx606.amr.corp.intel.com ([10.22.229.19]) by fmsmga005.fm.intel.com with ESMTP; 11 Sep 2020 14:30:26 -0700
Received: from orsmsx606.amr.corp.intel.com (10.22.229.19) by ORSMSX606.amr.corp.intel.com (10.22.229.19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1713.5; Fri, 11 Sep 2020 14:30:25 -0700
Received: from orsmsx101.amr.corp.intel.com (10.22.225.128) by orsmsx606.amr.corp.intel.com (10.22.229.19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256) id 15.1.1713.5 via Frontend Transport; Fri, 11 Sep 2020 14:30:25 -0700
Received: from tjmaciei-mobl1.localnet (10.254.85.139) by ORSMSX101.amr.corp.intel.com (10.22.225.128) with Microsoft SMTP Server (TLS) id 14.3.439.0; Fri, 11 Sep 2020 14:30:24 -0700
From: Thiago Macieira <thiago.macieira@intel.com>
To: cbor@ietf.org
Date: Fri, 11 Sep 2020 14:30:24 -0700
Message-ID: <1973898.N1gx0QA8IB@tjmaciei-mobl1>
Organization: Intel Corporation
In-Reply-To: <2766F4E6-0E67-472B-8BFA-75C529F4EE80@tzi.org>
References: <2766F4E6-0E67-472B-8BFA-75C529F4EE80@tzi.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="UTF-8"
X-Originating-IP: [10.254.85.139]
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/HJOv0r6IcVe1yHO5BMMd1D6Xqck>
Subject: Re: [Cbor] 7049bis: Diagnostic notation gaps
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 11 Sep 2020 21:30:30 -0000

On Friday, 11 September 2020 13:46:10 PDT Carsten Bormann wrote:
> * (_ ) is ambiguous whether it refers to 0x5fff (an indefinite length byte
> string with no chunks) or 0x7fff (an indefinite length text string with no
> chunks).  The proposal in #205 is to disambiguate this by adding
> characters, (_b ) vs. (_t ).
> 
> It occurred to me that a better way to represent these two encoded CBOR data
> items would be to use ‘’_ and “”_ (please excuse the smart quotes).  This
> is already allowed by the text in RFC 7049; we would just need to point it
> out by adding a sentence to the last paragraph of § 8.1.  Note that in RFC
> 7049 these notations are in principle ambiguous, i.e. ‘’_ could be a (_ )
> (0x5fff) or (_ ‘’) or (_ ‘’ ‘’) and so on; we would probably just clarify
> that these are to be used for (_ ) only as the other ones already have good
> diagnostic notation.

That would mean needing to keep a state whether the payload was empty or not 
and backtrack. A diagnostic printer currently can see the 0x5f or 0x7f and 
print "(_ ", then continue on. In order to print ""_ for 0x7fff, it needs to 
forego printing the parenthesis opening.

The simplest solution is to add the "b" or "t" unconditionally.

> (The payload here is the sign bit and the 11/24/53 significand [sic!] bits;
> in the binaryxx formats there is an exponent intervening of 5/8/11 bits
> which must be all ones, and the significand may not be all zeroes — those
> are the values for the two infinities, so only ~ 53.9999999999999998398
> bits are actually available.)
> 
> I think this is ugly as hell, but it also is the best proposal so far.

It would be far easier to dump the binary representation of the entire 
floating point number, including the exponent bits. If this format were 
allowed, a diagnostic printer with no support for stringifying floating point 
numbers could use it for other values too.

Second option would be to use it like gdb prints:

	-nan(0x20000)

The type information can be encoded as either "nanf" (float) and "nanf16" (for 
_Float16).

If parentheses are not a good idea, then a third option is to print only the 
significand bits, dropping the "0x" too. It's clearly hexadecimal, we don't 
need the "x". In that case, we'd use the _ modification to indicate encoding 
length, as "-nan_001_1" for a negative 16-bit signalling NaN.

Either way, the document should advise what to do when it comes to signalling 
/ quiet NaNs. The CBOR spec recommends that only the IEEE-recommended non-
signalling form be used

"If NaN is an allowed value, it must always be represented as 0xf97e00."

The table in Appendix A also lists 0xfa7fc00000 and 0xfb7ff8000000000000, 
which match the IEEE recommendations for QNaN. But the Wikipedia page warns 
that some (older?) machines invert it and encode signalling NaN with the 
topmost mantissa bit set.

I would recommend that it print "nan" only if it's one of those three. And if 
it's distinguishing the payload length, "nan" is only for the double-precision 
case, the others requiring "nanf" or "nanf16" (if using the variant with 
parentheses), or for the 16-bit with "nan_2" and "nan_3" for the single- and 
double-precision ones respectively if not using parentheses.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering