Re: [Cbor] Benjamin Kaduk's Discuss on draft-ietf-cbor-7049bis-14: (with DISCUSS and COMMENT)

Benjamin Kaduk <kaduk@mit.edu> Mon, 07 September 2020 23:35 UTC

Return-Path: <kaduk@mit.edu>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E57F83A0F70; Mon, 7 Sep 2020 16:35:55 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id T4-H1vODGaAf; Mon, 7 Sep 2020 16:35:53 -0700 (PDT)
Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id BA81E3A0F53; Mon, 7 Sep 2020 16:35:52 -0700 (PDT)
Received: from kduck.mit.edu ([24.16.140.251]) (authenticated bits=56) (User authenticated as kaduk@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 087NZcxK026783 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 7 Sep 2020 19:35:42 -0400
Date: Mon, 7 Sep 2020 16:35:38 -0700
From: Benjamin Kaduk <kaduk@mit.edu>
To: Jim Schaad <ietf@augustcellars.com>
Cc: "'The IESG'" <iesg@ietf.org>, cbor@ietf.org, draft-ietf-cbor-7049bis@ietf.org, cbor-chairs@ietf.org, francesca.palombini@ericsson.com
Message-ID: <20200907233538.GF16914@kduck.mit.edu>
References: <159951641517.13535.7424396818917958932@ietfa.amsl.com> <029f01d6856d$39e1b0f0$ada512d0$@augustcellars.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <029f01d6856d$39e1b0f0$ada512d0$@augustcellars.com>
User-Agent: Mutt/1.12.1 (2019-06-15)
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/OHtc_aRsuLZ6yikWTn7031r-YNU>
Subject: Re: [Cbor] Benjamin Kaduk's Discuss on draft-ietf-cbor-7049bis-14: (with DISCUSS and COMMENT)
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 07 Sep 2020 23:35:56 -0000

Hi Jim,

Thanks for the replies (inline).

On Mon, Sep 07, 2020 at 04:18:38PM -0700, Jim Schaad wrote:
> 
> 
> -----Original Message-----
> From: CBOR <cbor-bounces@ietf.org> On Behalf Of Benjamin Kaduk via Datatracker
> Sent: Monday, September 7, 2020 3:07 PM
> To: The IESG <iesg@ietf.org>
> Cc: cbor@ietf.org; draft-ietf-cbor-7049bis@ietf.org; cbor-chairs@ietf.org; francesca.palombini@ericsson.com
> Subject: [Cbor] Benjamin Kaduk's Discuss on draft-ietf-cbor-7049bis-14: (with DISCUSS and COMMENT)
> 
> Benjamin Kaduk has entered the following ballot position for
> draft-ietf-cbor-7049bis-14: Discuss
> 
> When responding, please keep the subject line intact and reply to all email addresses included in the To and CC lines. (Feel free to cut this introductory paragraph, however.)
> 
> 
> Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html
> for more information about IESG DISCUSS and COMMENT positions.
> 
> 
> The document, along with other ballot positions, can be found here:
> https://datatracker.ietf.org/doc/draft-ietf-cbor-7049bis/
> 
> 
> 
> ----------------------------------------------------------------------
> DISCUSS:
> ----------------------------------------------------------------------
> 
> Thanks for this document; it's generally well-written and the changes since 7049 are helpful.  I do have a few points that may need discussion before publication, though.
> 
> Let's discuss whether the framing of tag number 35 for "regular expressions that are roughly in [PCRE] form or a version of the JavaScript regular expression syntax" meets the interoperability expectations for Internet Standard status (bearing in mind that we are defining a data format and not a protocol).  I note that it is okay to leave the codepoint allocated with the current meaning and the previous document as its reference, but decline to discuss it in the document going for STD (we are in the process of doing that with COSE countersignatures at the moment).
> 
> The example in Section 5 of "the item is a map that has byte strings for keys and contains at least one pair whose key is 0xab01" seems to be in violation of the definition of a valid map, since applications are not allowed to rely on invalid behavior.  (That is, the implied "more than one pair whose key is 0xab01" would be invalid.)
> 
> [JLS] This was not how I read the text, I read it as not zero.  However it would make sense to made this edit.

I agree that if you read it as "not zero" it makes sense, but rewording
seems prudent.

> I think that the new deterministic encoding rules for sorting map keys should be clear about whether "no content" sorts before or after "content present" -- that is, how 0x10 and 0x1020 are ordered when the
> 0x10 byte is identical and we have to compare <nothing> with 0x20.
> 
> [JLS] I don't see this.  Going out to Wikipedia (the true answer for everything).  The rule that is listed there is:
> "If two words have different lengths, the usual lexicographical order pads the shorter one with "blanks" (a special symbol that is treated as smaller than every element of A) until the words are the same length, and then the words are compared as in the previous case."
> 
> This means that if A is a prefix of B and |A| < |B|, A precedes B in lexicographic order.
> [/JLS]

That was my expectation as well, but is this truly "textbook knowledge"
that we can assume is known without precondition or reference?

> The discussion in Appendix C suggests that C (programming language) implementations all use two's-complement representation of signed integers; this requirement is present in POSIX but not C itself (I verified this for C99 and C11).
> 
> Additionally, the encode_sint() function (also Appendix C) relies on C implementation-defined behavior while right-shifting a signed integer.
> 
> The C decode_half() function in Appendix D assumes that 'int' is wider than 16 bits (since assigning a value to an int16_t variable when the value is not representable in int16_t incurs implementation-defined behavior).  Given that this spec is specifically targetting constrained devices, it's not clear that such an assumption is justified.  (It also right shifts a signed integer, incurring the same implementation-defined behavior mentioned above.  (The bitwise AND against 0x8000 is also problematic for an int16_t.))
> 
> [JLS] I'll let others argue this.  But I think that if you don't use two's -complement then you most likely break a lot of code.

I am assuming that the resolution for the C vs POSIX point will just be to
say that we further assume two's complement integer representations; AFAIK
(or, rather, as far as Wikipedia knows), some UNIVAC descendants are the
only things using one's complement, and nobody uses sign-and-magnitude.

(The other two are not really dependent on signed integer representation.)

-Ben

> 
> ----------------------------------------------------------------------
> COMMENT:
> ----------------------------------------------------------------------
> 
> Is there a comprehensive list of things that generic (en/de)coders need to document their behavior for (e.g., how they handle duplicate map keys; whether/what validity checking is done, including which tag numbers are supported)?
> 
> We use the expression "simple value" around 30 times, but "simple type value" only twice (and "simple type" a few other times); are we happy with the consistency of usage?
> 
> Please also note my comments on the IANA considerations in the per-section comments; at least the first couple are fairly consequential.
> 
> I'm pretty sympathetic to the secdir reviewer's desire for guidance on how to implement validity checking.  I think it would be possible to slot this into the existing discussion of validity in §5.3/5.4, possibly as an additional subsection reiterating that it's required to check the bits in 5.3.1/5.3.2, and the expectation that such checks are likely to be incomplete in the face of new tag number allocations.
> 
> Section 1.2
> 
>    Where bit arithmetic or data types are explained, this document uses
>    the notation familiar from the programming language C, except that
> 
> In recent memory we've asked for some form of reference for "the programming language C" (even though the concepts we draw on are likely to remain invariant for anything called C).
> 
> Section 2
> 
>    In the basic (un-extended) generic data model, a data item is one of:
>    [...]
>    *  a sequence of zero or more Unicode code points ("text string")
> 
> Hmm, since we use "data item" for both the abstract idea and the representation-format version, this description is only precise for the abstract version (the representation is further constrained to UTF-8).
> I am not sure whether there is a concise way to accurately express this state, though.
> 
> Section 3
> 
>    The initial byte and any additional bytes consumed to construct the
>    argument are collectively referred to as the "head" of the data item.
> 
> side note: Interesting that we define "head" but do not use "tail" :)
> 
> Section 3.3
> 
>    meaning, as defined in Table 3.  Like the major types for integers,
>    items of this major type do not carry content data; all the
>    information is in the initial bytes.
> 
> (editorial) The "head", as it were, right?
> 
> Section 3.4
> 
>    Conceptually, tags are interpreted in the generic data model, not at
>    (de-)serialization time.  A small number of tags (specifically, tag
>    number 25 and tag number 29) have been registered with semantics that
>    may require processing at (de-)serialization time: The decoder needs
> 
> I suggest adding additional language to reiterate that this is a point-in-time statement (and thus that there may be other such tags in existence).
> 
>    This means these tags cannot be implemented on top of every generic
>    CBOR encoder/decoder (which might not reflect the serialization order
>    for entries in a map at the data model level and vice versa); their
>    implementation therefore typically needs to be integrated into the
>    generic encoder/decoder.  The definition of new tags with this
>    property is NOT RECOMMENDED.
> 
> So we should give guidance to the DEs for the registry in question to that effect?
> 
>    IANA allocated tag numbers 65535, 4294967295, and
>    18446744073709551615 (binary all-ones in 16-bit, 32-bit, and 64-bit).
>    These can be used as a convenience for implementers that want a
>    single integer to indicate either that a specific tag is present, or
>    the absence of a tag.  That allocation is described in Section 10 of
> 
> (editorial) Chasing the reference, I suggest that it is a "single integer *data structure*" in the implementation's internal representation; just reading this text alone left me confused as to how this was intended to be used.
> 
>    [I-D.bormann-cbor-notable-tags].  These tags are not intended to
>    occur in actual CBOR data items; implementations may flag such an
>    occurrence as an error.
> 
> I could maybe see this as "MAY".
> 
> Section 3.4.2
> 
> Thank you for mentioning leap seconds!
> 
>    Note that platform types for date/time may include null or undefined
>    values, which may also be desirable at an application protocol level.
>    While emitting tag number 1 values with non-finite tag content values
>    (e.g., with NaN for undefined date/time values or with Infinite for
>    an expiry date that is not set) may seem an obvious way to handle
>    this, using untagged null or undefined is often a better solution.
>    Application protocol designers are encouraged to consider these cases
>    and include clear guidelines for handling them.
> 
> It's rather unfortunate that the text here doesn't provide any justification for the claim of "better solution" (or reference to such justification).
> 
> Section 3.4.3
> 
>    occurs in a bignum when using preferred serialization).  Note that
>    this means the non-preferred choice of a bignum representation
>    instead of a basic integer for encoding a number is not intended to
>    have application semantics (just as the choice of a longer basic
>    integer representation than needed, such as 0x1800 for 0x00 does
>    not).
> 
> It may be "not intended to", but it does, if you're using a decoder in the generic data model.  We should be sure to cover the security considerations of this disparity (and the corresponding need for an application using CBOR to specify the data model it uses).
> 
> Section 3.4.5.3
> 
>    Note that tag numbers 33 and 34 differ from 21 and 22 in that the
>    data is transported in base-encoded form for the former and in raw
>    byte string form for the latter.
> 
> Do we want to mention tag 23 as well (as being the raw byte string)?
> 
> Section 4.2.1
> 
> [I did not validate the hex-encoded IEEE754 against the decimal values.]
> 
>    *  Indefinite-length items MUST NOT appear.  They can be encoded as
>       definite-length items instead.
> 
> One could perhaps argue that a deterministic encoding procedure that uses indefinite-length items is possible, and even useful in some cases.
> This might argue for moving this requirement to Section 4.2.2's list of "additional considerations".  That said, an application is not obligated to use these core rules and can define its own rules if needed, so I don't object to this requirement.
> 
> Section 4.2.3
> 
>    (Although [RFC7049] used the term "Canonical CBOR" for its form of
>    requirements on deterministic encoding, this document avoids this
>    term because "canonicalization" is often associated with specific
>    uses of deterministic encoding only.  The terms are essentially
>    interchangeable, however, and the set of core requirements in this
>    document could also be called "Canonical CBOR", while the length-
>    first-ordered version of that could be called "Old Canonical CBOR".)
> 
> If this document avoids the term, maybe the final sentence should not be present?
> 
> Section 5
> 
>    CBOR-based protocols MUST specify how their decoders handle invalid
>    and other unexpected data.  CBOR-based protocols MAY specify that
>    they treat arbitrary valid data as unexpected.  Encoders for CBOR-
>    based protocols MUST produce only valid items, that is, the protocol
>    cannot be designed to make use of invalid items.  An encoder can be
> 
> Just to check: my interpretation is that CBOR Sequences are compatible with this requirement, since they use valid data items and just encode them in sequence.  Right?
> 
> Section 5.1
> 
>    Other decoders can present partial information about a top-level data
>    item to an application, such as the nested data items that could
>    already be decoded, or even parts of a byte string that hasn't
>    completely arrived yet.
> 
> This has potential to make some security types antsy, if coupled with encryption mechanisms that release alleged plaintext prior to authenticity check.  It's not immediately clear that this text needs to change, though if it's also not a key point, perhaps it is easier to just drop the mention rather than think about it more, though I'd also be happy to see discussion of issues with streaming decryption in the security considerations section.
> 
> Section 5.3
> 
>    A CBOR-based protocol MUST specify which of these options its
>    decoders take, for each kind of invalid item they might encounter.
> 
> Are the lists of types of validity error presented in the following subsections exhaustive for the respective data models?  If so, it might be worth mentioning that explicitly.
> 
> Section 5.4
> 
>    *  It can report an error (and not return data).  Note that this
>       error is not a validity error per se.  This kind of error is more
>       likely to be raised by a decoder that would be performing validity
>       checking if this were a known case.
> 
> (soapbox) Could we maybe be a little less encouraging of this behavior?
> I am remembering horror stories of TLS stacks that did this for extension types, which is an interoperability nightmare.  I recognize that there are cases where it is the desired behavior, but in the general case tags are an extensibility point and we shouldn't encourage that joint to rust shut.
> 
> Section 5.6.1
> 
>    As discussed in Section 2.2, specific data models can make values
>    equivalent for the purpose of comparing map keys that are distinct in
>    the generic data model.  Note that this implies that a generic
>    decoder may deliver a decoded map to an application that needs to be
>    checked for duplicate map keys by that application (alternatively,
>    the decoder may provide a programming interface to perform this
>    service for the application).  Specific data models cannot
>    distinguish values for map keys that are equal for this purpose at
>    the generic data model level.
> 
> This last bit seems like something that is forbidden by the protocol (vs "cannot"); I wonder if a slight rewording is in order.
> 
> Section 6.2
> 
>    *  Numbers with fractional parts are represented as floating-point
>       values, performing the decimal-to-binary conversion based on the
>       precision provided by IEEE 754 binary64.  Then, when encoding in
> 
> I forget if this conversion requires round-to-nearest or if multiple rounding modes are available (the latter would of course be problematic if we proceed on to the "can be represented in smaller float without changing value" step).
> 
> Section 8
> 
>    The notation borrows the JSON syntax for numbers (integer and
>    floating-point), True (>true<), False (>false<), Null (>null<), UTF-8
> 
> (soapbox) Is literal '>' and '<' really the best quoting strategy here (and later on)?
> 
> Section 9.1, 9.2
> 
> What guidance can we give to the experts?
> 
> Section 9.3
> 
>    Applications that use this media type:  None yet, but it is expected
>       that this format will be deployed in protocols and applications.
> 
> I don't believe this to be currently accurate.
> 
>    Additional information:  *  Magic number(s): n/a
> 
> I guess 0xd9d9f7 doesn't count, then?
> 
> Section 9.4
> 
>    The CoAP Content-Format for CBOR is defined in
>    [IANA.core-parameters]:
> 
> Is "defined in" the right way to word this?
> 
> Section 10
> 
> I guess the attack where you use indefinite-length encoding to achieve total 'n' greater than 2**64 is not really practical at present...
> 
> Please add a mention of the risks of mixing a constrained decoder with a variant (non-preferred-serialization) encoder, as alluded to in Section 4.1.
> 
> I also mention this down in G.3, but there seem to be some relevant considerations regarding whether/when bignums and integers of the same value are considered to be equivalent, in particular that the situation is different depending on the data model in use.  This could probably fit nicely into general discussion of handling the multiple possible serializations of various data items.
> 
> I would consider (but am not sure if I would end up adding) a mention that CBOR can convey time values, and thus that protocols using CBOR to convey time values are likely to rely on a source of accurate time.
> 
> I might incorporate by reference the RFC 4648 security considerations since we talk about base64 in several places.
> 
> Protocols using CBOR text strings will likely have internationalization considerations; whether CBOR itself should mention this is not entirely clear to me.
> 
> The potential loss of (e.g., type) information when converting from CBOR to JSON is probably worth a mention, noting that applications performing such conversions should consider whether they are affected and/or it's desired to include specific type information in the generated JSON.
> 
>    numbers may exceed linear effort.  Also, some hash-table
>    implementations that are used by decoders to build in-memory
>    representations of maps can be attacked to spend quadratic effort,
>    unless a secret key (see Section 7 of [SIPHASH]) or some other
>    mitigation is employed.  Such superlinear efforts can be exploited by
> 
> It seems likely that an alternate reference not behind a paywall would be usable to make this point.
> 
> Section 11.2
> 
> Would not [PCRE] need to be normative (if that functionality remains, per the DISCUSS)?
> 
> Appendix A
> 
> [I did not verify the examples.]
> 
>    ATTIC FIFTY STATERS).  (Note that all these single-character strings
>    could also be represented in native UTF-8 in diagnostic notation,
>    just not in an ASCII-only specification like the present one.)  In
> 
> The present specification is not ASCII-only...
> 
> Appendix C
> 
>      return 0;                     // no break out
> 
> Should this be 'return mt'?  IIUC the return value is a message type or -1 for the break code, and errors are indicated out of band via fail().
> 
>    void encode_sint(int64_t n) {
>      uint64t ui = n >> 63;    // extend sign to whole length
>      mt = ui & 0x20;          // extract major type
> 
> If this is supposed to be C, you probably want to declare mt.
> 
> Appendix F
> 
>    *  major type 7, additional information 24, value < 32 (incorrect or
>       incorrectly encoded simple type)
> 
> I see "incorrectly encoded", but I'm not sure I understand what is meant by "incorrect simple type".
> 
> Appendix G.3
> 
>    integers and floating point values.  Experience from implementation
>    and use now suggested that the separation between these two number
>    domains should be more clearly drawn in the document; language that
>    suggested an integer could seamlessly stand in for a floating point
>    value was removed.  Also, a suggestion (based on I-JSON [RFC7493])
> 
> So instead we have skew between the generic data model and the extended model, where the generic model thinks some numers are different that the extended model treats as the same.  Should we mention that here as well?
> 
> 
> 
> _______________________________________________
> CBOR mailing list
> CBOR@ietf.org
> https://www.ietf.org/mailman/listinfo/cbor
>