Re: [Cbor] Benjamin Kaduk's Discuss on draft-ietf-cbor-7049bis-14: (with DISCUSS and COMMENT)

Ben,

Nothing earth shattering

-----Original Message-----
From: Benjamin Kaduk <kaduk@mit.edu> 
Sent: Monday, September 7, 2020 4:36 PM
To: Jim Schaad <ietf@augustcellars.com>
Cc: 'The IESG' <iesg@ietf.org>; cbor@ietf.org;
draft-ietf-cbor-7049bis@ietf.org; cbor-chairs@ietf.org;
francesca.palombini@ericsson.com
Subject: Re: [Cbor] Benjamin Kaduk's Discuss on draft-ietf-cbor-7049bis-14:
(with DISCUSS and COMMENT)

Hi Jim,

Thanks for the replies (inline).

On Mon, Sep 07, 2020 at 04:18:38PM -0700, Jim Schaad wrote:
> 
> 
> -----Original Message-----
> From: CBOR <cbor-bounces@ietf.org> On Behalf Of Benjamin Kaduk via 
> Datatracker
> Sent: Monday, September 7, 2020 3:07 PM
> To: The IESG <iesg@ietf.org>
> Cc: cbor@ietf.org; draft-ietf-cbor-7049bis@ietf.org; 
> cbor-chairs@ietf.org; francesca.palombini@ericsson.com
> Subject: [Cbor] Benjamin Kaduk's Discuss on 
> draft-ietf-cbor-7049bis-14: (with DISCUSS and COMMENT)
> 
> Benjamin Kaduk has entered the following ballot position for
> draft-ietf-cbor-7049bis-14: Discuss
> 
> When responding, please keep the subject line intact and reply to all 
> email addresses included in the To and CC lines. (Feel free to cut 
> this introductory paragraph, however.)
> 
> 
> Please refer to 
> https://www.ietf.org/iesg/statement/discuss-criteria.html
> for more information about IESG DISCUSS and COMMENT positions.
> 
> 
> The document, along with other ballot positions, can be found here:
> https://datatracker.ietf.org/doc/draft-ietf-cbor-7049bis/
> 
> 
> 
> ----------------------------------------------------------------------
> DISCUSS:
> ----------------------------------------------------------------------
> 
> Thanks for this document; it's generally well-written and the changes
since 7049 are helpful.  I do have a few points that may need discussion
before publication, though.
> 
> Let's discuss whether the framing of tag number 35 for "regular
expressions that are roughly in [PCRE] form or a version of the JavaScript
regular expression syntax" meets the interoperability expectations for
Internet Standard status (bearing in mind that we are defining a data format
and not a protocol).  I note that it is okay to leave the codepoint
allocated with the current meaning and the previous document as its
reference, but decline to discuss it in the document going for STD (we are
in the process of doing that with COSE countersignatures at the moment).
> 
> The example in Section 5 of "the item is a map that has byte strings 
> for keys and contains at least one pair whose key is 0xab01" seems to 
> be in violation of the definition of a valid map, since applications 
> are not allowed to rely on invalid behavior.  (That is, the implied 
> "more than one pair whose key is 0xab01" would be invalid.)
> 
> [JLS] This was not how I read the text, I read it as not zero.  However it
would make sense to made this edit.

I agree that if you read it as "not zero" it makes sense, but rewording
seems prudent.

> I think that the new deterministic encoding rules for sorting map keys 
> should be clear about whether "no content" sorts before or after 
> "content present" -- that is, how 0x10 and 0x1020 are ordered when the
> 0x10 byte is identical and we have to compare <nothing> with 0x20.
> 
> [JLS] I don't see this.  Going out to Wikipedia (the true answer for
everything).  The rule that is listed there is:
> "If two words have different lengths, the usual lexicographical order pads
the shorter one with "blanks" (a special symbol that is treated as smaller
than every element of A) until the words are the same length, and then the
words are compared as in the previous case."
> 
> This means that if A is a prefix of B and |A| < |B|, A precedes B in
lexicographic order.
> [/JLS]

That was my expectation as well, but is this truly "textbook knowledge"
that we can assume is known without precondition or reference?

[JLS]  I have a firm believe that almost everybody can figure out how to do
sorting of books by either author or title.  Caveat the argument over the
question of do "The" and "A" count as words for this purpose.  I don't think
this qualifies as something that you learn from a textbook, I think it is
something people just know.

> The discussion in Appendix C suggests that C (programming language)
implementations all use two's-complement representation of signed integers;
this requirement is present in POSIX but not C itself (I verified this for
C99 and C11).
> 
> Additionally, the encode_sint() function (also Appendix C) relies on C
implementation-defined behavior while right-shifting a signed integer.
> 
> The C decode_half() function in Appendix D assumes that 'int' is wider 
> than 16 bits (since assigning a value to an int16_t variable when the 
> value is not representable in int16_t incurs implementation-defined 
> behavior).  Given that this spec is specifically targetting 
> constrained devices, it's not clear that such an assumption is 
> justified.  (It also right shifts a signed integer, incurring the same 
> implementation-defined behavior mentioned above.  (The bitwise AND 
> against 0x8000 is also problematic for an int16_t.))
> 
> [JLS] I'll let others argue this.  But I think that if you don't use two's
-complement then you most likely break a lot of code.

I am assuming that the resolution for the C vs POSIX point will just be to
say that we further assume two's complement integer representations; AFAIK
(or, rather, as far as Wikipedia knows), some UNIVAC descendants are the
only things using one's complement, and nobody uses sign-and-magnitude.

(The other two are not really dependent on signed integer representation.)

[JLS] There are still UNIVAC descendants out in the real world?  
jim

-Ben

> 
> ----------------------------------------------------------------------
> COMMENT:
> ----------------------------------------------------------------------
> 
> Is there a comprehensive list of things that generic (en/de)coders need to
document their behavior for (e.g., how they handle duplicate map keys;
whether/what validity checking is done, including which tag numbers are
supported)?
> 
> We use the expression "simple value" around 30 times, but "simple type
value" only twice (and "simple type" a few other times); are we happy with
the consistency of usage?
> 
> Please also note my comments on the IANA considerations in the per-section
comments; at least the first couple are fairly consequential.
> 
> I'm pretty sympathetic to the secdir reviewer's desire for guidance on how
to implement validity checking.  I think it would be possible to slot this
into the existing discussion of validity in §5.3/5.4, possibly as an
additional subsection reiterating that it's required to check the bits in
5.3.1/5.3.2, and the expectation that such checks are likely to be
incomplete in the face of new tag number allocations.
> 
> Section 1.2
> 
>    Where bit arithmetic or data types are explained, this document uses
>    the notation familiar from the programming language C, except that
> 
> In recent memory we've asked for some form of reference for "the
programming language C" (even though the concepts we draw on are likely to
remain invariant for anything called C).
> 
> Section 2
> 
>    In the basic (un-extended) generic data model, a data item is one of:
>    [...]
>    *  a sequence of zero or more Unicode code points ("text string")
> 
> Hmm, since we use "data item" for both the abstract idea and the
representation-format version, this description is only precise for the
abstract version (the representation is further constrained to UTF-8).
> I am not sure whether there is a concise way to accurately express this
state, though.
> 
> Section 3
> 
>    The initial byte and any additional bytes consumed to construct the
>    argument are collectively referred to as the "head" of the data item.
> 
> side note: Interesting that we define "head" but do not use "tail" :)
> 
> Section 3.3
> 
>    meaning, as defined in Table 3.  Like the major types for integers,
>    items of this major type do not carry content data; all the
>    information is in the initial bytes.
> 
> (editorial) The "head", as it were, right?
> 
> Section 3.4
> 
>    Conceptually, tags are interpreted in the generic data model, not at
>    (de-)serialization time.  A small number of tags (specifically, tag
>    number 25 and tag number 29) have been registered with semantics that
>    may require processing at (de-)serialization time: The decoder 
> needs
> 
> I suggest adding additional language to reiterate that this is a
point-in-time statement (and thus that there may be other such tags in
existence).
> 
>    This means these tags cannot be implemented on top of every generic
>    CBOR encoder/decoder (which might not reflect the serialization order
>    for entries in a map at the data model level and vice versa); their
>    implementation therefore typically needs to be integrated into the
>    generic encoder/decoder.  The definition of new tags with this
>    property is NOT RECOMMENDED.
> 
> So we should give guidance to the DEs for the registry in question to that
effect?
> 
>    IANA allocated tag numbers 65535, 4294967295, and
>    18446744073709551615 (binary all-ones in 16-bit, 32-bit, and 64-bit).
>    These can be used as a convenience for implementers that want a
>    single integer to indicate either that a specific tag is present, or
>    the absence of a tag.  That allocation is described in Section 10 
> of
> 
> (editorial) Chasing the reference, I suggest that it is a "single integer
*data structure*" in the implementation's internal representation; just
reading this text alone left me confused as to how this was intended to be
used.
> 
>    [I-D.bormann-cbor-notable-tags].  These tags are not intended to
>    occur in actual CBOR data items; implementations may flag such an
>    occurrence as an error.
> 
> I could maybe see this as "MAY".
> 
> Section 3.4.2
> 
> Thank you for mentioning leap seconds!
> 
>    Note that platform types for date/time may include null or undefined
>    values, which may also be desirable at an application protocol level.
>    While emitting tag number 1 values with non-finite tag content values
>    (e.g., with NaN for undefined date/time values or with Infinite for
>    an expiry date that is not set) may seem an obvious way to handle
>    this, using untagged null or undefined is often a better solution.
>    Application protocol designers are encouraged to consider these cases
>    and include clear guidelines for handling them.
> 
> It's rather unfortunate that the text here doesn't provide any
justification for the claim of "better solution" (or reference to such
justification).
> 
> Section 3.4.3
> 
>    occurs in a bignum when using preferred serialization).  Note that
>    this means the non-preferred choice of a bignum representation
>    instead of a basic integer for encoding a number is not intended to
>    have application semantics (just as the choice of a longer basic
>    integer representation than needed, such as 0x1800 for 0x00 does
>    not).
> 
> It may be "not intended to", but it does, if you're using a decoder in the
generic data model.  We should be sure to cover the security considerations
of this disparity (and the corresponding need for an application using CBOR
to specify the data model it uses).
> 
> Section 3.4.5.3
> 
>    Note that tag numbers 33 and 34 differ from 21 and 22 in that the
>    data is transported in base-encoded form for the former and in raw
>    byte string form for the latter.
> 
> Do we want to mention tag 23 as well (as being the raw byte string)?
> 
> Section 4.2.1
> 
> [I did not validate the hex-encoded IEEE754 against the decimal 
> values.]
> 
>    *  Indefinite-length items MUST NOT appear.  They can be encoded as
>       definite-length items instead.
> 
> One could perhaps argue that a deterministic encoding procedure that uses
indefinite-length items is possible, and even useful in some cases.
> This might argue for moving this requirement to Section 4.2.2's list of
"additional considerations".  That said, an application is not obligated to
use these core rules and can define its own rules if needed, so I don't
object to this requirement.
> 
> Section 4.2.3
> 
>    (Although [RFC7049] used the term "Canonical CBOR" for its form of
>    requirements on deterministic encoding, this document avoids this
>    term because "canonicalization" is often associated with specific
>    uses of deterministic encoding only.  The terms are essentially
>    interchangeable, however, and the set of core requirements in this
>    document could also be called "Canonical CBOR", while the length-
>    first-ordered version of that could be called "Old Canonical 
> CBOR".)
> 
> If this document avoids the term, maybe the final sentence should not be
present?
> 
> Section 5
> 
>    CBOR-based protocols MUST specify how their decoders handle invalid
>    and other unexpected data.  CBOR-based protocols MAY specify that
>    they treat arbitrary valid data as unexpected.  Encoders for CBOR-
>    based protocols MUST produce only valid items, that is, the protocol
>    cannot be designed to make use of invalid items.  An encoder can be
> 
> Just to check: my interpretation is that CBOR Sequences are compatible
with this requirement, since they use valid data items and just encode them
in sequence.  Right?
> 
> Section 5.1
> 
>    Other decoders can present partial information about a top-level data
>    item to an application, such as the nested data items that could
>    already be decoded, or even parts of a byte string that hasn't
>    completely arrived yet.
> 
> This has potential to make some security types antsy, if coupled with
encryption mechanisms that release alleged plaintext prior to authenticity
check.  It's not immediately clear that this text needs to change, though if
it's also not a key point, perhaps it is easier to just drop the mention
rather than think about it more, though I'd also be happy to see discussion
of issues with streaming decryption in the security considerations section.
> 
> Section 5.3
> 
>    A CBOR-based protocol MUST specify which of these options its
>    decoders take, for each kind of invalid item they might encounter.
> 
> Are the lists of types of validity error presented in the following
subsections exhaustive for the respective data models?  If so, it might be
worth mentioning that explicitly.
> 
> Section 5.4
> 
>    *  It can report an error (and not return data).  Note that this
>       error is not a validity error per se.  This kind of error is more
>       likely to be raised by a decoder that would be performing validity
>       checking if this were a known case.
> 
> (soapbox) Could we maybe be a little less encouraging of this behavior?
> I am remembering horror stories of TLS stacks that did this for extension
types, which is an interoperability nightmare.  I recognize that there are
cases where it is the desired behavior, but in the general case tags are an
extensibility point and we shouldn't encourage that joint to rust shut.
> 
> Section 5.6.1
> 
>    As discussed in Section 2.2, specific data models can make values
>    equivalent for the purpose of comparing map keys that are distinct in
>    the generic data model.  Note that this implies that a generic
>    decoder may deliver a decoded map to an application that needs to be
>    checked for duplicate map keys by that application (alternatively,
>    the decoder may provide a programming interface to perform this
>    service for the application).  Specific data models cannot
>    distinguish values for map keys that are equal for this purpose at
>    the generic data model level.
> 
> This last bit seems like something that is forbidden by the protocol (vs
"cannot"); I wonder if a slight rewording is in order.
> 
> Section 6.2
> 
>    *  Numbers with fractional parts are represented as floating-point
>       values, performing the decimal-to-binary conversion based on the
>       precision provided by IEEE 754 binary64.  Then, when encoding in
> 
> I forget if this conversion requires round-to-nearest or if multiple
rounding modes are available (the latter would of course be problematic if
we proceed on to the "can be represented in smaller float without changing
value" step).
> 
> Section 8
> 
>    The notation borrows the JSON syntax for numbers (integer and
>    floating-point), True (>true<), False (>false<), Null (>null<), 
> UTF-8
> 
> (soapbox) Is literal '>' and '<' really the best quoting strategy here
(and later on)?
> 
> Section 9.1, 9.2
> 
> What guidance can we give to the experts?
> 
> Section 9.3
> 
>    Applications that use this media type:  None yet, but it is expected
>       that this format will be deployed in protocols and applications.
> 
> I don't believe this to be currently accurate.
> 
>    Additional information:  *  Magic number(s): n/a
> 
> I guess 0xd9d9f7 doesn't count, then?
> 
> Section 9.4
> 
>    The CoAP Content-Format for CBOR is defined in
>    [IANA.core-parameters]:
> 
> Is "defined in" the right way to word this?
> 
> Section 10
> 
> I guess the attack where you use indefinite-length encoding to achieve
total 'n' greater than 2**64 is not really practical at present...
> 
> Please add a mention of the risks of mixing a constrained decoder with a
variant (non-preferred-serialization) encoder, as alluded to in Section 4.1.
> 
> I also mention this down in G.3, but there seem to be some relevant
considerations regarding whether/when bignums and integers of the same value
are considered to be equivalent, in particular that the situation is
different depending on the data model in use.  This could probably fit
nicely into general discussion of handling the multiple possible
serializations of various data items.
> 
> I would consider (but am not sure if I would end up adding) a mention that
CBOR can convey time values, and thus that protocols using CBOR to convey
time values are likely to rely on a source of accurate time.
> 
> I might incorporate by reference the RFC 4648 security considerations
since we talk about base64 in several places.
> 
> Protocols using CBOR text strings will likely have internationalization
considerations; whether CBOR itself should mention this is not entirely
clear to me.
> 
> The potential loss of (e.g., type) information when converting from CBOR
to JSON is probably worth a mention, noting that applications performing
such conversions should consider whether they are affected and/or it's
desired to include specific type information in the generated JSON.
> 
>    numbers may exceed linear effort.  Also, some hash-table
>    implementations that are used by decoders to build in-memory
>    representations of maps can be attacked to spend quadratic effort,
>    unless a secret key (see Section 7 of [SIPHASH]) or some other
>    mitigation is employed.  Such superlinear efforts can be exploited 
> by
> 
> It seems likely that an alternate reference not behind a paywall would be
usable to make this point.
> 
> Section 11.2
> 
> Would not [PCRE] need to be normative (if that functionality remains, per
the DISCUSS)?
> 
> Appendix A
> 
> [I did not verify the examples.]
> 
>    ATTIC FIFTY STATERS).  (Note that all these single-character strings
>    could also be represented in native UTF-8 in diagnostic notation,
>    just not in an ASCII-only specification like the present one.)  In
> 
> The present specification is not ASCII-only...
> 
> Appendix C
> 
>      return 0;                     // no break out
> 
> Should this be 'return mt'?  IIUC the return value is a message type or -1
for the break code, and errors are indicated out of band via fail().
> 
>    void encode_sint(int64_t n) {
>      uint64t ui = n >> 63;    // extend sign to whole length
>      mt = ui & 0x20;          // extract major type
> 
> If this is supposed to be C, you probably want to declare mt.
> 
> Appendix F
> 
>    *  major type 7, additional information 24, value < 32 (incorrect or
>       incorrectly encoded simple type)
> 
> I see "incorrectly encoded", but I'm not sure I understand what is meant
by "incorrect simple type".
> 
> Appendix G.3
> 
>    integers and floating point values.  Experience from implementation
>    and use now suggested that the separation between these two number
>    domains should be more clearly drawn in the document; language that
>    suggested an integer could seamlessly stand in for a floating point
>    value was removed.  Also, a suggestion (based on I-JSON [RFC7493])
> 
> So instead we have skew between the generic data model and the extended
model, where the generic model thinks some numers are different that the
extended model treats as the same.  Should we mention that here as well?
> 
> 
> 
> _______________________________________________
> CBOR mailing list
> CBOR@ietf.org
> https://www.ietf.org/mailman/listinfo/cbor
>