[Cbor] Benjamin Kaduk's Discuss on draft-ietf-cbor-7049bis-14: (with DISCUSS and COMMENT)

Benjamin Kaduk via Datatracker <noreply@ietf.org> Mon, 07 September 2020 22:06 UTC

Return-Path: <noreply@ietf.org>
X-Original-To: cbor@ietf.org
Delivered-To: cbor@ietfa.amsl.com
Received: from ietfa.amsl.com (localhost [IPv6:::1]) by ietfa.amsl.com (Postfix) with ESMTP id 323A23A0529; Mon, 7 Sep 2020 15:06:55 -0700 (PDT)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
From: Benjamin Kaduk via Datatracker <noreply@ietf.org>
To: The IESG <iesg@ietf.org>
Cc: draft-ietf-cbor-7049bis@ietf.org, cbor-chairs@ietf.org, cbor@ietf.org, Francesca Palombini <francesca.palombini@ericsson.com>, francesca.palombini@ericsson.com
X-Test-IDTracker: no
X-IETF-IDTracker: 7.15.1
Auto-Submitted: auto-generated
Precedence: bulk
Reply-To: Benjamin Kaduk <kaduk@mit.edu>
Message-ID: <159951641517.13535.7424396818917958932@ietfa.amsl.com>
Date: Mon, 07 Sep 2020 15:06:55 -0700
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/IAmYK6adwfbo9SBYeniLDzOGDhU>
Subject: [Cbor] Benjamin Kaduk's Discuss on draft-ietf-cbor-7049bis-14: (with DISCUSS and COMMENT)
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 07 Sep 2020 22:06:55 -0000

Benjamin Kaduk has entered the following ballot position for
draft-ietf-cbor-7049bis-14: Discuss

When responding, please keep the subject line intact and reply to all
email addresses included in the To and CC lines. (Feel free to cut this
introductory paragraph, however.)


Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html
for more information about IESG DISCUSS and COMMENT positions.


The document, along with other ballot positions, can be found here:
https://datatracker.ietf.org/doc/draft-ietf-cbor-7049bis/



----------------------------------------------------------------------
DISCUSS:
----------------------------------------------------------------------

Thanks for this document; it's generally well-written and the changes
since 7049 are helpful.  I do have a few points that may need discussion
before publication, though.

Let's discuss whether the framing of tag number 35 for "regular
expressions that are roughly in [PCRE] form or a version of the
JavaScript regular expression syntax" meets the interoperability
expectations for Internet Standard status (bearing in mind that we are
defining a data format and not a protocol).  I note that it is okay
to leave the codepoint allocated with the current meaning and the
previous document as its reference, but decline to discuss it in the
document going for STD (we are in the process of doing that with COSE
countersignatures at the moment).

The example in Section 5 of "the item is a map that has byte strings for
keys and contains at least one pair whose key is 0xab01" seems to be in
violation of the definition of a valid map, since applications are not
allowed to rely on invalid behavior.  (That is, the implied "more than
one pair whose key is 0xab01" would be invalid.)

I think that the new deterministic encoding rules for sorting map keys
should be clear about whether "no content" sorts before or after
"content present" -- that is, how 0x10 and 0x1020 are ordered when the
0x10 byte is identical and we have to compare <nothing> with 0x20.

The discussion in Appendix C suggests that C (programming language)
implementations all use two's-complement representation of signed
integers; this requirement is present in POSIX but not C itself (I
verified this for C99 and C11).

Additionally, the encode_sint() function (also Appendix C) relies on C
implementation-defined behavior while right-shifting a signed integer.

The C decode_half() function in Appendix D assumes that 'int' is wider
than 16 bits (since assigning a value to an int16_t variable when the
value is not representable in int16_t incurs implementation-defined
behavior).  Given that this spec is specifically targetting constrained
devices, it's not clear that such an assumption is justified.  (It also
right shifts a signed integer, incurring the same implementation-defined
behavior mentioned above.  (The bitwise AND against 0x8000 is also
problematic for an int16_t.))


----------------------------------------------------------------------
COMMENT:
----------------------------------------------------------------------

Is there a comprehensive list of things that generic (en/de)coders need
to document their behavior for (e.g., how they handle duplicate map
keys; whether/what validity checking is done, including which tag
numbers are supported)?

We use the expression "simple value" around 30 times, but "simple type
value" only twice (and "simple type" a few other times); are we happy
with the consistency of usage?

Please also note my comments on the IANA considerations in the
per-section comments; at least the first couple are fairly
consequential.

I'm pretty sympathetic to the secdir reviewer's desire for guidance on
how to implement validity checking.  I think it would be possible to
slot this into the existing discussion of validity in §5.3/5.4, possibly as
an additional subsection reiterating that it's required to check the
bits in 5.3.1/5.3.2, and the expectation that such checks are likely to
be incomplete in the face of new tag number allocations.

Section 1.2

   Where bit arithmetic or data types are explained, this document uses
   the notation familiar from the programming language C, except that

In recent memory we've asked for some form of reference for "the
programming language C" (even though the concepts we draw on are likely
to remain invariant for anything called C).

Section 2

   In the basic (un-extended) generic data model, a data item is one of:
   [...]
   *  a sequence of zero or more Unicode code points ("text string")

Hmm, since we use "data item" for both the abstract idea and the
representation-format version, this description is only precise for the
abstract version (the representation is further constrained to UTF-8).
I am not sure whether there is a concise way to accurately express this
state, though.

Section 3

   The initial byte and any additional bytes consumed to construct the
   argument are collectively referred to as the "head" of the data item.

side note: Interesting that we define "head" but do not use "tail" :)

Section 3.3

   meaning, as defined in Table 3.  Like the major types for integers,
   items of this major type do not carry content data; all the
   information is in the initial bytes.

(editorial) The "head", as it were, right?

Section 3.4

   Conceptually, tags are interpreted in the generic data model, not at
   (de-)serialization time.  A small number of tags (specifically, tag
   number 25 and tag number 29) have been registered with semantics that
   may require processing at (de-)serialization time: The decoder needs

I suggest adding additional language to reiterate that this is a
point-in-time statement (and thus that there may be other such tags in
existence).

   This means these tags cannot be implemented on top of every generic
   CBOR encoder/decoder (which might not reflect the serialization order
   for entries in a map at the data model level and vice versa); their
   implementation therefore typically needs to be integrated into the
   generic encoder/decoder.  The definition of new tags with this
   property is NOT RECOMMENDED.

So we should give guidance to the DEs for the registry in question to
that effect?

   IANA allocated tag numbers 65535, 4294967295, and
   18446744073709551615 (binary all-ones in 16-bit, 32-bit, and 64-bit).
   These can be used as a convenience for implementers that want a
   single integer to indicate either that a specific tag is present, or
   the absence of a tag.  That allocation is described in Section 10 of

(editorial) Chasing the reference, I suggest that it is a "single
integer *data structure*" in the implementation's internal
representation; just reading this text alone left me confused as to how
this was intended to be used.

   [I-D.bormann-cbor-notable-tags].  These tags are not intended to
   occur in actual CBOR data items; implementations may flag such an
   occurrence as an error.

I could maybe see this as "MAY".

Section 3.4.2

Thank you for mentioning leap seconds!

   Note that platform types for date/time may include null or undefined
   values, which may also be desirable at an application protocol level.
   While emitting tag number 1 values with non-finite tag content values
   (e.g., with NaN for undefined date/time values or with Infinite for
   an expiry date that is not set) may seem an obvious way to handle
   this, using untagged null or undefined is often a better solution.
   Application protocol designers are encouraged to consider these cases
   and include clear guidelines for handling them.

It's rather unfortunate that the text here doesn't provide any
justification for the claim of "better solution" (or reference to such
justification).

Section 3.4.3

   occurs in a bignum when using preferred serialization).  Note that
   this means the non-preferred choice of a bignum representation
   instead of a basic integer for encoding a number is not intended to
   have application semantics (just as the choice of a longer basic
   integer representation than needed, such as 0x1800 for 0x00 does
   not).

It may be "not intended to", but it does, if you're using a decoder in
the generic data model.  We should be sure to cover the security
considerations of this disparity (and the corresponding need for an
application using CBOR to specify the data model it uses).

Section 3.4.5.3

   Note that tag numbers 33 and 34 differ from 21 and 22 in that the
   data is transported in base-encoded form for the former and in raw
   byte string form for the latter.

Do we want to mention tag 23 as well (as being the raw byte string)?

Section 4.2.1

[I did not validate the hex-encoded IEEE754 against the decimal values.]

   *  Indefinite-length items MUST NOT appear.  They can be encoded as
      definite-length items instead.

One could perhaps argue that a deterministic encoding procedure that
uses indefinite-length items is possible, and even useful in some cases.
This might argue for moving this requirement to Section 4.2.2's
list of "additional considerations".  That said, an application is not
obligated to use these core rules and can define its own rules if
needed, so I don't object to this requirement.

Section 4.2.3

   (Although [RFC7049] used the term "Canonical CBOR" for its form of
   requirements on deterministic encoding, this document avoids this
   term because "canonicalization" is often associated with specific
   uses of deterministic encoding only.  The terms are essentially
   interchangeable, however, and the set of core requirements in this
   document could also be called "Canonical CBOR", while the length-
   first-ordered version of that could be called "Old Canonical CBOR".)

If this document avoids the term, maybe the final sentence should not be
present?

Section 5

   CBOR-based protocols MUST specify how their decoders handle invalid
   and other unexpected data.  CBOR-based protocols MAY specify that
   they treat arbitrary valid data as unexpected.  Encoders for CBOR-
   based protocols MUST produce only valid items, that is, the protocol
   cannot be designed to make use of invalid items.  An encoder can be

Just to check: my interpretation is that CBOR Sequences are compatible
with this requirement, since they use valid data items and just encode
them in sequence.  Right?

Section 5.1

   Other decoders can present partial information about a top-level data
   item to an application, such as the nested data items that could
   already be decoded, or even parts of a byte string that hasn't
   completely arrived yet.

This has potential to make some security types antsy, if coupled with
encryption mechanisms that release alleged plaintext prior to
authenticity check.  It's not immediately clear that this text needs to
change, though if it's also not a key point, perhaps it is easier to
just drop the mention rather than think about it more, though I'd also
be happy to see discussion of issues with streaming decryption in the
security considerations section.

Section 5.3

   A CBOR-based protocol MUST specify which of these options its
   decoders take, for each kind of invalid item they might encounter.

Are the lists of types of validity error presented in the following
subsections exhaustive for the respective data models?  If so, it might
be worth mentioning that explicitly.

Section 5.4

   *  It can report an error (and not return data).  Note that this
      error is not a validity error per se.  This kind of error is more
      likely to be raised by a decoder that would be performing validity
      checking if this were a known case.

(soapbox) Could we maybe be a little less encouraging of this behavior?
I am remembering horror stories of TLS stacks that did this for
extension types, which is an interoperability nightmare.  I recognize
that there are cases where it is the desired behavior, but in the
general case tags are an extensibility point and we shouldn't encourage
that joint to rust shut.

Section 5.6.1

   As discussed in Section 2.2, specific data models can make values
   equivalent for the purpose of comparing map keys that are distinct in
   the generic data model.  Note that this implies that a generic
   decoder may deliver a decoded map to an application that needs to be
   checked for duplicate map keys by that application (alternatively,
   the decoder may provide a programming interface to perform this
   service for the application).  Specific data models cannot
   distinguish values for map keys that are equal for this purpose at
   the generic data model level.

This last bit seems like something that is forbidden by the protocol (vs
"cannot"); I wonder if a slight rewording is in order.

Section 6.2

   *  Numbers with fractional parts are represented as floating-point
      values, performing the decimal-to-binary conversion based on the
      precision provided by IEEE 754 binary64.  Then, when encoding in

I forget if this conversion requires round-to-nearest or if multiple
rounding modes are available (the latter would of course be problematic
if we proceed on to the "can be represented in smaller float without
changing value" step).

Section 8

   The notation borrows the JSON syntax for numbers (integer and
   floating-point), True (>true<), False (>false<), Null (>null<), UTF-8

(soapbox) Is literal '>' and '<' really the best quoting strategy here
(and later on)?

Section 9.1, 9.2

What guidance can we give to the experts?

Section 9.3

   Applications that use this media type:  None yet, but it is expected
      that this format will be deployed in protocols and applications.

I don't believe this to be currently accurate.

   Additional information:  *  Magic number(s): n/a

I guess 0xd9d9f7 doesn't count, then?

Section 9.4

   The CoAP Content-Format for CBOR is defined in
   [IANA.core-parameters]:

Is "defined in" the right way to word this?

Section 10

I guess the attack where you use indefinite-length encoding to achieve
total 'n' greater than 2**64 is not really practical at present...

Please add a mention of the risks of mixing a constrained decoder with a
variant (non-preferred-serialization) encoder, as alluded to in Section
4.1.

I also mention this down in G.3, but there seem to be some relevant
considerations regarding whether/when bignums and integers of the same
value are considered to be equivalent, in particular that the situation
is different depending on the data model in use.  This could probably
fit nicely into general discussion of handling the multiple possible
serializations of various data items.

I would consider (but am not sure if I would end up adding) a mention
that CBOR can convey time values, and thus that protocols using CBOR to
convey time values are likely to rely on a source of accurate time.

I might incorporate by reference the RFC 4648 security considerations
since we talk about base64 in several places.

Protocols using CBOR text strings will likely have internationalization
considerations; whether CBOR itself should mention this is not entirely
clear to me.

The potential loss of (e.g., type) information when converting from CBOR
to JSON is probably worth a mention, noting that applications performing
such conversions should consider whether they are affected and/or it's
desired to include specific type information in the generated JSON.

   numbers may exceed linear effort.  Also, some hash-table
   implementations that are used by decoders to build in-memory
   representations of maps can be attacked to spend quadratic effort,
   unless a secret key (see Section 7 of [SIPHASH]) or some other
   mitigation is employed.  Such superlinear efforts can be exploited by

It seems likely that an alternate reference not behind a paywall would
be usable to make this point.

Section 11.2

Would not [PCRE] need to be normative (if that functionality remains,
per the DISCUSS)?

Appendix A

[I did not verify the examples.]

   ATTIC FIFTY STATERS).  (Note that all these single-character strings
   could also be represented in native UTF-8 in diagnostic notation,
   just not in an ASCII-only specification like the present one.)  In

The present specification is not ASCII-only...

Appendix C

     return 0;                     // no break out

Should this be 'return mt'?  IIUC the return value is a message type
or -1 for the break code, and errors are indicated out of band via
fail().

   void encode_sint(int64_t n) {
     uint64t ui = n >> 63;    // extend sign to whole length
     mt = ui & 0x20;          // extract major type

If this is supposed to be C, you probably want to declare mt.

Appendix F

   *  major type 7, additional information 24, value < 32 (incorrect or
      incorrectly encoded simple type)

I see "incorrectly encoded", but I'm not sure I understand what is meant
by "incorrect simple type".

Appendix G.3

   integers and floating point values.  Experience from implementation
   and use now suggested that the separation between these two number
   domains should be more clearly drawn in the document; language that
   suggested an integer could seamlessly stand in for a floating point
   value was removed.  Also, a suggestion (based on I-JSON [RFC7493])

So instead we have skew between the generic data model and the extended
model, where the generic model thinks some numers are different that the
extended model treats as the same.  Should we mention that here as well?