Re: [Cbor] 🔔 WGLC on draft-ietf-cbor-7049bis-09

Carsten Bormann <cabo@tzi.org> Wed, 15 January 2020 15:36 UTC

Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.0 \(3608.40.2.2.4\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <CANh-dXnPRd7w_z2LA0gYD0GHVbmych4BGA5_-vmJz+Zn1qBh_w@mail.gmail.com>
Date: Wed, 15 Jan 2020 16:36:13 +0100
Cc: Francesca Palombini <francesca.palombini=40ericsson.com@dmarc.ietf.org>, "cbor@ietf.org" <cbor@ietf.org>, "draft-ietf-cbor-7049bis@ietf.org" <draft-ietf-cbor-7049bis@ietf.org>, "cbor-chairs@ietf.org" <cbor-chairs@ietf.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <3050C0BF-2B1A-4396-ADED-AD5ED4C6EC60@tzi.org>
References: <293AFF31-D0EF-45D6-9B9D-E8136481C404@ericsson.com> <CANh-dXnPRd7w_z2LA0gYD0GHVbmych4BGA5_-vmJz+Zn1qBh_w@mail.gmail.com>
To: Jeffrey Yasskin <jyasskin@chromium.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/cd1iixDnE0y40gmhtgOFYOjZjFc>
Subject: Re: [Cbor] 🔔 WGLC on draft-ietf-cbor-7049bis-09
Precedence: list

Hi Jeffrey,

Beyond the one comment that I singled out on 3.4.4, here are my responses/proposals:

> 1.  Introduction
> 	• "The format defined here follows some specific design goals that are not well met by current formats." <- "Follows" doesn't really work. Maybe "pursues"? "Satisfies"?

"Pursues", I’d say.

> 	• "It is important to note that this is not a proposal that the grammar in RFC 8259 be extended in general, since doing so would cause a significant backwards incompatibility with already deployed JSON documents." <- This is probably not necessary anymore. It could be removed entirely or replaced with "This format is not an extension of the grammar in RFC 8259."

I like that.

> 3.1.  Major Types
> 	• In Table 1, can we spell out "Major Type" instead of "mt"?

We can, but that reduces space for the other columns; I’d rather fix the caption to repeat the text about mt and N above the table.

I took the liberty to add the equivalent table for indefinite length as table 2; please review.

> 3.4.  Tagging of Items
> 	• I'm not sure what "while retaining its structure" accomplishes here. Can we remove it?

(It really means: “without the need to change its structure”.)
This is a bit of a trap door thing: Once you understand the concept, this appears redundant.
But I’m also at a loss how to say this better, so I agree we should remove it.

> 	• "That is, a tag is a data item consisting of a tag number and an enclosed value. The content of the tag (the enclosed data item) is the data item (the value) that is being tagged." might be unnecessary. I think it's covered by the earlier text in this section. If it's still needed, it should probably move earlier to where we define tagged data. It doesn't fit next to the discussion of what it means to put a tag inside a tag.

Right, we should reverse the order.

> 	• "it can just jump over the initial bytes of the tag (that encode the tag number)" isn't quite right: it's not just skipping it, it's reporting both the tag number and value to the application. Maybe "Understanding the semantics of tags is optional for a decoder: it can present the tag number and content to the application without interpreting the tag as a whole."

I made this a bit more specific about “interpreting”.

> 3.4.1.  Date and Time
> 	• The next two sections seem like they should be subsections.

(This was already changed earlier.)

> 
> 3.4.3.  Epoch-based Date/Time
> 	• "An application that requires tag number 1 support may restrict" has a lowercase MAY, which has an ambiguous effect after RFC 8174. Do we want MAY or can?

This is not an interoperability MAY, as it describes what an application can do.
That maybe raises the general question how we are using BCP14 language, but RFC 8174 was written to clarify that lower case may does not have the special meaning of upper case MAY.

> 3.4.4.  Bignums
> 	• This section has a forward reference to "preferred encoding", which should cite section 4.1. I note that 4.1 uses "preferred serialization" instead, so maybe we should switch this section to that term.

Interesting fine point.  I changed all “preferred encoding” to “preferred serialization” (except the one that is used in defining the latter) and added the reference.  We may want to keep some form of differentiation between preferred serialization (which is never visible at the basic data model level) and the choice of a basic data model realization for an extended data model value; maybe we should discuss this with the item below.

> 	• "and preferred encoding never makes use of bignums that also can be expressed as basic integers (see below)." <- This seems inconsistent with "In the generic data model, bignum values are not equal to integers from the basic data model". If they're not the same value at the data model level, they can't be alternate encodings of each other.

See separate message.

> 3.4.5.  Decimal Fractions and Bigfloats
> 	• "Decimal fractions (tag number 4) use base-10 exponents; the value of a decimal fraction data item is m*(10**e). Bigfloats (tag number 5) use base-2 exponents; the value of a bigfloat data item is m*(2**e)." is redundant with the first paragraph of the section.

Somewhat.  The first paragraph says what can be represented, the fourth exactly what is being represented.  Is the redundancy a big problem?

> 	• This section also suggests that integers be used instead of integral bigdecimals and bigfloats. That only works if the specific data model says they're equivalent. Maybe we should say specific data models SHOULD make them equivalent and then SHOULD set the preferred encoding to the integer version?

Generally, we have been moving away from making int and float equivalent, so I agree this recommendation needs to be reshaped.   A similar question occurs with basic floating point as well, so we should come to a common view on these.  Now issue #160: https://github.com/cbor-wg/CBORbis/issues/160

> 3.4.6.2.  Expected Later Encoding for CBOR-to-JSON Converters
> 	• This section only defines the tags obliquely and never says what tag 23 means. I suggest starting sentences with the tag number, e.g. "Tag number 21 means the contained byte string is expected to be encoded in base64url without padding ... Tag number 22 means ..."

Did this; added reference to Section 8 of RFC 4648, and emphasized the uppercase alphabetics used there.

> 3.4.6.3.  Encoded Text
> 	• "Tag numbers 33 and 34 are for base64url- and base64-encoded text strings" should maybe have "respectively"?

Added.

> 
> 4.1.  Preferred Serialization
> 	• "1_000_000_000" has enough digits that maybe we should use 10**9 (or 10<sup>9</sup> in v3) instead.

Three by three is obvious enough to me, but I have added “(one billion)” to add more fun to milliard-educated people like me.

> 4.2.  Deterministically Encoded CBOR
> 	• "Some protocols may want" has a lowercase "may". Consider "might".

See comment on RFC 8174 above.
(This is not a sufficiently remote corner case to use “might”, in my eyes.)

> 4.2.1.  Core Deterministic Encoding Requirements
> 	• This section says "Floating point values also MUST use the shortest form that preserves the value, e.g. 1.5 is encoded as 0xf93e00 and 1000000.5 as 0xfa49742408.", but 4.2.2 says "If a protocol allows for IEEE floats, then additional deterministic encoding rules might need to be added." We should only put the float rule in one of these sections.

Yes, we are now probably ready to reduce the variety here.  I moved over the text about trying successively shorter representations as an implementation note for 4.2.1.  I also added some text about an application wanting to limit precision.

> 4.2.2.  Additional Deterministic Encoding Considerations
> 	• "the deterministic format would not allow them" isn't clear what "them" is. Do we mean "would not allow the data to be tagged"? Or should we just say that the deterministic format for the protocol needs to specify whether the tag is or is not present?
> 	• The "If a protocol includes a field that can express floating-point values" paragraph also assumes "… and the protocol's specific data model declares integers and floating point values to be interchangeable".

Right.  Case 3 is a bit questionable, as this is not conforming to preferred serialization.

> 	• The "A protocol might give encoders the choice of representing a URL ..." item feels like it's repeating the "CBOR tags present …" paragraph. Maybe we should move it to an example in that paragraph?

Moved and made into an example.

> 	• Maybe the ""Protocols that include floating, big integer, or other complex values need to define extra requirements on their deterministic encodings. For example:" introductory sentence belongs at the top of the whole section.

Right.  I restructured the section a bit.
The part about bignums maybe also needs to be made stronger.

> 5.  Creating CBOR-Based Protocols
> 	• "This section discusses" might read better as "The rest of this section" since it's after a bit of the section.

Right.

> 
> 5.2.  Generic Encoders and Decoders
> 	• "Even though CBOR attempts to minimize these cases, not all well-formed CBOR data is valid" is redundant with a lot of text from earlier in the document.

Right.  Redundancy is not always bad, though; some repetition can be useful.

> 	• I wonder if this whole subsection is out of place. It reads like a definition of generic en/decoders rather than a consideration for designing protocols. Maybe it should be a subsection of "Terminology"?

Good point.  Maybe the results of processing #156 will also point in that direction.
But the concept is not just terminology, it is a constraint on the design decisions of CBOR and its protocols.  If a protocol is designed in such a way that it cannot use generic codecs, it really gives up a major advantage of using CBOR — this is why the subsection is here at the moment.

> 5.3.  Validity of Items
> 	• "The first layer that does process the semantics of an invalid CBOR item MUST take one of two choices:" covers our discussion of duplicate map keys. Right now, our requirements aren't consistent with the requirements in this section, so we should make sure to incorporate this section when we resolve those.

Right.  Issue #63.

> 
> 5.3.2.  Tag validity
> 	• "might present this tag to the application in a similar way to how it would present a tag with an unknown tag number" seems to suggest that it's wrong to replace the invalid tag with an error marker or to stop processing entirely, even though that's what 5.3 suggests.

Again, discussion about what a generic decoder should do.  Or, here, specifically how the validity function of a tag could evolve.

> 5.5.  Numbers
> 	• "the JavaScript number system treats all numbers as floating point" is no longer true: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/BigInt

BigInts are not JavaScript Numbers.  (This legacy-forced separation is a whole can of worms and does not really detract from what is said.)  I added “basic”.

> 	• "a compact application should accept" uses a lowercase "should" and would seem to discourage compact applications that check for a deterministic decoding. Do we mean that accepting wider encodings is likely to make the application more compact?

Not all applications SHOULD check for deterministic encoding…
Added "that does not require deterministic encoding”.

> 	• "The preferred encoding for a floating-point value is the shortest floating-point encoding" is redundant with section 4.1, although it does include more detail. I *think* I'd rather put the whole definition of the preferred encoding in 4.1 instead of having some of it in protocol considerations.

Moved (and added back some text here).

> 5.6.  Specifying Keys for Maps
> 	• "probably should be limited" -> "may need to be limited" or "the specification is probably simpler if"? To avoid BCP 14 terms. 

Restated this slightly.

> 	• We're already discussing the question of duplicate keys in another thread.

Yes; issue #63.

> 	• "except to specify that some, orders" has an extra comma.
> 	• "be enough reason on its own" -> "on their own"
> 	• The "should consider using small integers as keys" has the downside that it makes it harder for humans to understand the meaning of the data without the schema. "for constrained devices" might protect the rest of us from that downside, but would it make sense to say it explicitly?

One word: “BTU” :-)
I think we are already explicit about this, but we don’t mention the downside (or the upside that you are now forced to properly define the things).

> 5.6.1.  Equivalence of Keys
> 	• This section might be shorter if it just says that map keys are duplicates if they have the same value in the generic data model or if the specific data model for the protocol (Section 2.2) says they're equivalent. The rest of the section just duplicates information that's already in Section 2. The note in the last paragraph does still seem useful.

Will handle with #63.

> 8.1.  Encoding Indicators
> 	• "Note that the encoding indicator "_" is thus an abbreviation of the full form "_7", which is not used." is confusing where it is. It might make more sense if we swap its paragraph with the previous one and move it to after the definition of "[_ 1, 2]".

I’m not sure I understand how to do this without forward references.

> 	• "As a special case, byte and text strings of indefinite length" doesn't seem like a special case to me. It's just the way you represent the encoding of an indefinite-length byte or text string.

Yes.  Removed “as a special case”.
(We still don’t have a way to notate an empty indefinite length string.)

> 
> 9.2.  Tags Registry
> 	• Did we decide not to tighten registration for the 256–65535 space?

I couldn’t find anything conclusive; the discussion seemed to center on the designated expert.

> 9.3.  Media Type ("MIME Type")
> 	• Is there a reason this section switches to artwork in the middle?

Lazyness.

(Oh, this probably should be fixed, too, now #161.
There is a long-standing bug in XML2RFC that creates incorrect formatting if this is done right.)

> 
> 9.4.  CoAP Content-Format
> 	• Could this section link to https://www.iana.org/assignments/core-parameters/core-parameters.xhtml#content-formats?

Yes.  (Not sure how to do the sub registry part, though.)


> Appendix A.  Examples
> 	• This starts with some references to Unicode code points, which could use the new <u> tag.

If we want to take a full v3 plunge…

> 
> Appendix F.  Changes from RFC 7049
> 	• This looks quite incomplete.

Indeed!  Now #163: https://github.com/cbor-wg/CBORbis/issues/162

Grüße, Carsten

[Cbor] 🔔 WGLC on draft-ietf-cbor-7049bis-09 Francesca Palombini
Re: [Cbor] 🔔 WGLC on draft-ietf-cbor-7049bis-09 Jeffrey Yasskin
Re: [Cbor] 🔔 WGLC on draft-ietf-cbor-7049bis-09 Francesca Palombini
[Cbor] Bignums and the generic data models (Re: 🔔… Carsten Bormann
Re: [Cbor] 🔔 WGLC on draft-ietf-cbor-7049bis-09 Carsten Bormann
Re: [Cbor] 🔔 WGLC on draft-ietf-cbor-7049bis-09 Carsten Bormann
Re: [Cbor] Bignums and the generic data models (R… Jeffrey Yasskin
Re: [Cbor] Bignums and the generic data models (R… Carsten Bormann
Re: [Cbor] Bignums and the generic data models (R… Laurence Lundblade
Re: [Cbor] Bignums and the generic data models (R… Carsten Bormann
Re: [Cbor] 🔔 WGLC on draft-ietf-cbor-7049bis-09 Carsten Bormann
Re: [Cbor] 🔔 WGLC on draft-ietf-cbor-7049bis-09 Francesca Palombini
Re: [Cbor] 🔔 WGLC on draft-ietf-cbor-7049bis-09 Carsten Bormann