Re: [Cbor] Benjamin Kaduk's Discuss on draft-ietf-cbor-7049bis-14: (with DISCUSS and COMMENT)

-----Original Message-----
From: CBOR <cbor-bounces@ietf.org> On Behalf Of Benjamin Kaduk via Datatracker
Sent: Monday, September 7, 2020 3:07 PM
To: The IESG <iesg@ietf.org>
Cc: cbor@ietf.org; draft-ietf-cbor-7049bis@ietf.org; cbor-chairs@ietf.org; francesca.palombini@ericsson.com
Subject: [Cbor] Benjamin Kaduk's Discuss on draft-ietf-cbor-7049bis-14: (with DISCUSS and COMMENT)

Benjamin Kaduk has entered the following ballot position for
draft-ietf-cbor-7049bis-14: Discuss

When responding, please keep the subject line intact and reply to all email addresses included in the To and CC lines. (Feel free to cut this introductory paragraph, however.)

Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html
for more information about IESG DISCUSS and COMMENT positions.

The document, along with other ballot positions, can be found here:
https://datatracker.ietf.org/doc/draft-ietf-cbor-7049bis/

----------------------------------------------------------------------
DISCUSS:
----------------------------------------------------------------------

Thanks for this document; it's generally well-written and the changes since 7049 are helpful.  I do have a few points that may need discussion before publication, though.

Let's discuss whether the framing of tag number 35 for "regular expressions that are roughly in [PCRE] form or a version of the JavaScript regular expression syntax" meets the interoperability expectations for Internet Standard status (bearing in mind that we are defining a data format and not a protocol).  I note that it is okay to leave the codepoint allocated with the current meaning and the previous document as its reference, but decline to discuss it in the document going for STD (we are in the process of doing that with COSE countersignatures at the moment).

The example in Section 5 of "the item is a map that has byte strings for keys and contains at least one pair whose key is 0xab01" seems to be in violation of the definition of a valid map, since applications are not allowed to rely on invalid behavior.  (That is, the implied "more than one pair whose key is 0xab01" would be invalid.)

[JLS] This was not how I read the text, I read it as not zero.  However it would make sense to made this edit.

I think that the new deterministic encoding rules for sorting map keys should be clear about whether "no content" sorts before or after "content present" -- that is, how 0x10 and 0x1020 are ordered when the
0x10 byte is identical and we have to compare <nothing> with 0x20.

[JLS] I don't see this.  Going out to Wikipedia (the true answer for everything).  The rule that is listed there is:
"If two words have different lengths, the usual lexicographical order pads the shorter one with "blanks" (a special symbol that is treated as smaller than every element of A) until the words are the same length, and then the words are compared as in the previous case."

This means that if A is a prefix of B and |A| < |B|, A precedes B in lexicographic order.
[/JLS]

The discussion in Appendix C suggests that C (programming language) implementations all use two's-complement representation of signed integers; this requirement is present in POSIX but not C itself (I verified this for C99 and C11).

Additionally, the encode_sint() function (also Appendix C) relies on C implementation-defined behavior while right-shifting a signed integer.

The C decode_half() function in Appendix D assumes that 'int' is wider than 16 bits (since assigning a value to an int16_t variable when the value is not representable in int16_t incurs implementation-defined behavior).  Given that this spec is specifically targetting constrained devices, it's not clear that such an assumption is justified.  (It also right shifts a signed integer, incurring the same implementation-defined behavior mentioned above.  (The bitwise AND against 0x8000 is also problematic for an int16_t.))

[JLS] I'll let others argue this.  But I think that if you don't use two's -complement then you most likely break a lot of code.

----------------------------------------------------------------------
COMMENT:
----------------------------------------------------------------------

Is there a comprehensive list of things that generic (en/de)coders need to document their behavior for (e.g., how they handle duplicate map keys; whether/what validity checking is done, including which tag numbers are supported)?

We use the expression "simple value" around 30 times, but "simple type value" only twice (and "simple type" a few other times); are we happy with the consistency of usage?

Please also note my comments on the IANA considerations in the per-section comments; at least the first couple are fairly consequential.

I'm pretty sympathetic to the secdir reviewer's desire for guidance on how to implement validity checking.  I think it would be possible to slot this into the existing discussion of validity in §5.3/5.4, possibly as an additional subsection reiterating that it's required to check the bits in 5.3.1/5.3.2, and the expectation that such checks are likely to be incomplete in the face of new tag number allocations.

Section 1.2

   Where bit arithmetic or data types are explained, this document uses
   the notation familiar from the programming language C, except that

In recent memory we've asked for some form of reference for "the programming language C" (even though the concepts we draw on are likely to remain invariant for anything called C).

Section 2

   In the basic (un-extended) generic data model, a data item is one of:
   [...]
   *  a sequence of zero or more Unicode code points ("text string")

Hmm, since we use "data item" for both the abstract idea and the representation-format version, this description is only precise for the abstract version (the representation is further constrained to UTF-8).
I am not sure whether there is a concise way to accurately express this state, though.

Section 3

   The initial byte and any additional bytes consumed to construct the
   argument are collectively referred to as the "head" of the data item.

side note: Interesting that we define "head" but do not use "tail" :)

Section 3.3

   meaning, as defined in Table 3.  Like the major types for integers,
   items of this major type do not carry content data; all the
   information is in the initial bytes.

(editorial) The "head", as it were, right?

Section 3.4

   Conceptually, tags are interpreted in the generic data model, not at
   (de-)serialization time.  A small number of tags (specifically, tag
   number 25 and tag number 29) have been registered with semantics that
   may require processing at (de-)serialization time: The decoder needs

I suggest adding additional language to reiterate that this is a point-in-time statement (and thus that there may be other such tags in existence).

   This means these tags cannot be implemented on top of every generic
   CBOR encoder/decoder (which might not reflect the serialization order
   for entries in a map at the data model level and vice versa); their
   implementation therefore typically needs to be integrated into the
   generic encoder/decoder.  The definition of new tags with this
   property is NOT RECOMMENDED.

So we should give guidance to the DEs for the registry in question to that effect?

   IANA allocated tag numbers 65535, 4294967295, and
   18446744073709551615 (binary all-ones in 16-bit, 32-bit, and 64-bit).
   These can be used as a convenience for implementers that want a
   single integer to indicate either that a specific tag is present, or
   the absence of a tag.  That allocation is described in Section 10 of

(editorial) Chasing the reference, I suggest that it is a "single integer *data structure*" in the implementation's internal representation; just reading this text alone left me confused as to how this was intended to be used.

   [I-D.bormann-cbor-notable-tags].  These tags are not intended to
   occur in actual CBOR data items; implementations may flag such an
   occurrence as an error.

I could maybe see this as "MAY".

Section 3.4.2

Thank you for mentioning leap seconds!

   Note that platform types for date/time may include null or undefined
   values, which may also be desirable at an application protocol level.
   While emitting tag number 1 values with non-finite tag content values
   (e.g., with NaN for undefined date/time values or with Infinite for
   an expiry date that is not set) may seem an obvious way to handle
   this, using untagged null or undefined is often a better solution.
   Application protocol designers are encouraged to consider these cases
   and include clear guidelines for handling them.

It's rather unfortunate that the text here doesn't provide any justification for the claim of "better solution" (or reference to such justification).

Section 3.4.3

   occurs in a bignum when using preferred serialization).  Note that
   this means the non-preferred choice of a bignum representation
   instead of a basic integer for encoding a number is not intended to
   have application semantics (just as the choice of a longer basic
   integer representation than needed, such as 0x1800 for 0x00 does
   not).

It may be "not intended to", but it does, if you're using a decoder in the generic data model.  We should be sure to cover the security considerations of this disparity (and the corresponding need for an application using CBOR to specify the data model it uses).

Section 3.4.5.3

   Note that tag numbers 33 and 34 differ from 21 and 22 in that the
   data is transported in base-encoded form for the former and in raw
   byte string form for the latter.

Do we want to mention tag 23 as well (as being the raw byte string)?

Section 4.2.1

[I did not validate the hex-encoded IEEE754 against the decimal values.]

   *  Indefinite-length items MUST NOT appear.  They can be encoded as
      definite-length items instead.

One could perhaps argue that a deterministic encoding procedure that uses indefinite-length items is possible, and even useful in some cases.
This might argue for moving this requirement to Section 4.2.2's list of "additional considerations".  That said, an application is not obligated to use these core rules and can define its own rules if needed, so I don't object to this requirement.

Section 4.2.3

   (Although [RFC7049] used the term "Canonical CBOR" for its form of
   requirements on deterministic encoding, this document avoids this
   term because "canonicalization" is often associated with specific
   uses of deterministic encoding only.  The terms are essentially
   interchangeable, however, and the set of core requirements in this
   document could also be called "Canonical CBOR", while the length-
   first-ordered version of that could be called "Old Canonical CBOR".)

If this document avoids the term, maybe the final sentence should not be present?

Section 5

   CBOR-based protocols MUST specify how their decoders handle invalid
   and other unexpected data.  CBOR-based protocols MAY specify that
   they treat arbitrary valid data as unexpected.  Encoders for CBOR-
   based protocols MUST produce only valid items, that is, the protocol
   cannot be designed to make use of invalid items.  An encoder can be

Just to check: my interpretation is that CBOR Sequences are compatible with this requirement, since they use valid data items and just encode them in sequence.  Right?

Section 5.1

   Other decoders can present partial information about a top-level data
   item to an application, such as the nested data items that could
   already be decoded, or even parts of a byte string that hasn't
   completely arrived yet.

This has potential to make some security types antsy, if coupled with encryption mechanisms that release alleged plaintext prior to authenticity check.  It's not immediately clear that this text needs to change, though if it's also not a key point, perhaps it is easier to just drop the mention rather than think about it more, though I'd also be happy to see discussion of issues with streaming decryption in the security considerations section.

Section 5.3

   A CBOR-based protocol MUST specify which of these options its
   decoders take, for each kind of invalid item they might encounter.

Are the lists of types of validity error presented in the following subsections exhaustive for the respective data models?  If so, it might be worth mentioning that explicitly.

Section 5.4

   *  It can report an error (and not return data).  Note that this
      error is not a validity error per se.  This kind of error is more
      likely to be raised by a decoder that would be performing validity
      checking if this were a known case.

(soapbox) Could we maybe be a little less encouraging of this behavior?
I am remembering horror stories of TLS stacks that did this for extension types, which is an interoperability nightmare.  I recognize that there are cases where it is the desired behavior, but in the general case tags are an extensibility point and we shouldn't encourage that joint to rust shut.

Section 5.6.1

   As discussed in Section 2.2, specific data models can make values
   equivalent for the purpose of comparing map keys that are distinct in
   the generic data model.  Note that this implies that a generic
   decoder may deliver a decoded map to an application that needs to be
   checked for duplicate map keys by that application (alternatively,
   the decoder may provide a programming interface to perform this
   service for the application).  Specific data models cannot
   distinguish values for map keys that are equal for this purpose at
   the generic data model level.

This last bit seems like something that is forbidden by the protocol (vs "cannot"); I wonder if a slight rewording is in order.

Section 6.2

   *  Numbers with fractional parts are represented as floating-point
      values, performing the decimal-to-binary conversion based on the
      precision provided by IEEE 754 binary64.  Then, when encoding in

I forget if this conversion requires round-to-nearest or if multiple rounding modes are available (the latter would of course be problematic if we proceed on to the "can be represented in smaller float without changing value" step).

Section 8

   The notation borrows the JSON syntax for numbers (integer and
   floating-point), True (>true<), False (>false<), Null (>null<), UTF-8

(soapbox) Is literal '>' and '<' really the best quoting strategy here (and later on)?

Section 9.1, 9.2

What guidance can we give to the experts?

Section 9.3

   Applications that use this media type:  None yet, but it is expected
      that this format will be deployed in protocols and applications.

I don't believe this to be currently accurate.

   Additional information:  *  Magic number(s): n/a

I guess 0xd9d9f7 doesn't count, then?

Section 9.4

   The CoAP Content-Format for CBOR is defined in
   [IANA.core-parameters]:

Is "defined in" the right way to word this?

Section 10

I guess the attack where you use indefinite-length encoding to achieve total 'n' greater than 2**64 is not really practical at present...

Please add a mention of the risks of mixing a constrained decoder with a variant (non-preferred-serialization) encoder, as alluded to in Section 4.1.

I also mention this down in G.3, but there seem to be some relevant considerations regarding whether/when bignums and integers of the same value are considered to be equivalent, in particular that the situation is different depending on the data model in use.  This could probably fit nicely into general discussion of handling the multiple possible serializations of various data items.

I would consider (but am not sure if I would end up adding) a mention that CBOR can convey time values, and thus that protocols using CBOR to convey time values are likely to rely on a source of accurate time.

I might incorporate by reference the RFC 4648 security considerations since we talk about base64 in several places.

Protocols using CBOR text strings will likely have internationalization considerations; whether CBOR itself should mention this is not entirely clear to me.

The potential loss of (e.g., type) information when converting from CBOR to JSON is probably worth a mention, noting that applications performing such conversions should consider whether they are affected and/or it's desired to include specific type information in the generated JSON.

   numbers may exceed linear effort.  Also, some hash-table
   implementations that are used by decoders to build in-memory
   representations of maps can be attacked to spend quadratic effort,
   unless a secret key (see Section 7 of [SIPHASH]) or some other
   mitigation is employed.  Such superlinear efforts can be exploited by

It seems likely that an alternate reference not behind a paywall would be usable to make this point.

Section 11.2

Would not [PCRE] need to be normative (if that functionality remains, per the DISCUSS)?

Appendix A

[I did not verify the examples.]

   ATTIC FIFTY STATERS).  (Note that all these single-character strings
   could also be represented in native UTF-8 in diagnostic notation,
   just not in an ASCII-only specification like the present one.)  In

The present specification is not ASCII-only...

Appendix C

     return 0;                     // no break out

Should this be 'return mt'?  IIUC the return value is a message type or -1 for the break code, and errors are indicated out of band via fail().

   void encode_sint(int64_t n) {
     uint64t ui = n >> 63;    // extend sign to whole length
     mt = ui & 0x20;          // extract major type

If this is supposed to be C, you probably want to declare mt.

Appendix F

   *  major type 7, additional information 24, value < 32 (incorrect or
      incorrectly encoded simple type)

I see "incorrectly encoded", but I'm not sure I understand what is meant by "incorrect simple type".

Appendix G.3

   integers and floating point values.  Experience from implementation
   and use now suggested that the separation between these two number
   domains should be more clearly drawn in the document; language that
   suggested an integer could seamlessly stand in for a floating point
   value was removed.  Also, a suggestion (based on I-JSON [RFC7493])

So instead we have skew between the generic data model and the extended model, where the generic model thinks some numers are different that the extended model treats as the same.  Should we mention that here as well?

_______________________________________________
CBOR mailing list
CBOR@ietf.org
https://www.ietf.org/mailman/listinfo/cbor