Re: [Cbor] [Ext] Benjamin Kaduk's Discuss on draft-ietf-cbor-7049bis-14: (with DISCUSS and COMMENT)

Benjamin Kaduk <kaduk@mit.edu> Mon, 28 September 2020 05:25 UTC

Return-Path: <kaduk@mit.edu>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A9B423A0E3C; Sun, 27 Sep 2020 22:25:53 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.796
X-Spam-Level:
X-Spam-Status: No, score=-1.796 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_BLOCKED=0.001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URI_HEX=0.1] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id O9z1GUOJZ0Hi; Sun, 27 Sep 2020 22:25:51 -0700 (PDT)
Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 9DE343A0E31; Sun, 27 Sep 2020 22:25:50 -0700 (PDT)
Received: from kduck.mit.edu ([24.16.140.251]) (authenticated bits=56) (User authenticated as kaduk@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 08S5PQ63018035 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 28 Sep 2020 01:25:28 -0400
Date: Sun, 27 Sep 2020 22:25:25 -0700
From: Benjamin Kaduk <kaduk@mit.edu>
To: Paul Hoffman <paul.hoffman@icann.org>
Cc: The IESG <iesg@ietf.org>, "cbor@ietf.org" <cbor@ietf.org>
Message-ID: <20200928052525.GJ89563@kduck.mit.edu>
References: <159951641517.13535.7424396818917958932@ietfa.amsl.com> <D0F65E21-94A6-4987-ABCB-BE2B76E8BAAC@icann.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <D0F65E21-94A6-4987-ABCB-BE2B76E8BAAC@icann.org>
User-Agent: Mutt/1.12.1 (2019-06-15)
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/kmCiVQhtMlC96Ne0K3gy_nikudY>
Subject: Re: [Cbor] [Ext] Benjamin Kaduk's Discuss on draft-ietf-cbor-7049bis-14: (with DISCUSS and COMMENT)
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Sep 2020 05:25:54 -0000

Hi Paul,

Sorry for the week's delay -- you caught me as I was recovering from the
tail end of some bug and pressed to meet the telechat deadline.

On Mon, Sep 21, 2020 at 03:59:39PM +0000, Paul Hoffman wrote:
> Here are our responses to the COMMENTs. This is a follow-up to our earlier
> responses to the DISCUSSes.
> 
> --Paul Hoffman
> 
> > ----------------------------------------------------------------------
> > COMMENT:
> > ----------------------------------------------------------------------
> > 
> > Is there a comprehensive list of things that generic (en/de)coders need
> > to document their behavior for (e.g., how they handle duplicate map
> > keys; whether/what validity checking is done, including which tag
> > numbers are supported)?
> 
> No.  Such a document coming from the CBOR WG in the future could be useful.

Definitely.  (But it sounds like there isn't one in the works yet.)

> > We use the expression "simple value" around 30 times, but "simple type
> > value" only twice (and "simple type" a few other times); are we happy
> > with the consistency of usage?
> 
> We simplified to the common expression in the first case and reworded
> the second; we used the opportunity to increase the consistency of the
> terminology some more.

Thanks!

> > Please also note my comments on the IANA considerations in the
> > per-section comments; at least the first couple are fairly
> > consequential.
> 
> Noted and commented on.
> 
> > I'm pretty sympathetic to the secdir reviewer's desire for guidance on
> > how to implement validity checking.  I think it would be possible to
> > slot this into the existing discussion of validity in §5.3/5.4, possibly as
> > an additional subsection reiterating that it's required to check the
> > bits in 5.3.1/5.3.2, and the expectation that such checks are likely to
> > be incomplete in the face of new tag number allocations.
> 
> Adding such guidance could indeed be useful, but not at this late stage in the
> document. It seems like good work for the CBOR WG to adopt in the future.
> 
> > Section 1.2
> > 
> > Where bit arithmetic or data types are explained, this document uses
> > the notation familiar from the programming language C, except that
> > 
> > In recent memory we've asked for some form of reference for "the
> > programming language C" (even though the concepts we draw on are likely
> > to remain invariant for anything called C).
> 
> We will add references to C that match those from the recently-approved RFC 8746.
> Done in <https://github.com/cbor-wg/CBORbis/pull/208>.
> 
> > Section 2
> > 
> > In the basic (un-extended) generic data model, a data item is one of:
> > [...]
> > *  a sequence of zero or more Unicode code points ("text string")
> > 
> > Hmm, since we use "data item" for both the abstract idea and the
> > representation-format version, this description is only precise for the
> > abstract version (the representation is further constrained to UTF-8).
> > I am not sure whether there is a concise way to accurately express this
> > state, though.
> 
> We have been using “encoded data item” for the latter; as we are talking about
> the model level, it should be reasonably clear that we are talking about the
> conceptual data item here.  (There is no great word for that; model-level data
> item is maybe another, as clumsy as conceptual.)

Okay, let's leave this as-is for want of a better alternative.

> > Section 3
> > 
> > The initial byte and any additional bytes consumed to construct the
> > argument are collectively referred to as the "head" of the data item.
> > 
> > side note: Interesting that we define "head" but do not use "tail" :)
> 
> There was no need for us to define tail/body in this spec.
> 
> > Section 3.3
> > 
> > meaning, as defined in Table 3.  Like the major types for integers,
> > items of this major type do not carry content data; all the
> > information is in the initial bytes.
> > 
> > (editorial) The "head", as it were, right?
> 
> Yes, we will add that as a parenthesis.
> 
> > Section 3.4
> > 
> > Conceptually, tags are interpreted in the generic data model, not at
> > (de-)serialization time.  A small number of tags (specifically, tag
> > number 25 and tag number 29) have been registered with semantics that
> > may require processing at (de-)serialization time: The decoder needs
> > 
> > I suggest adding additional language to reiterate that this is a
> > point-in-time statement (and thus that there may be other such tags in
> > existence).
> 
> We will replace "specifically" with "at this time"

"at the time of this writing" might age better, though I expect the RPC
staff to flag this, too.

> > This means these tags cannot be implemented on top of every generic
> > CBOR encoder/decoder (which might not reflect the serialization order
> > for entries in a map at the data model level and vice versa); their
> > implementation therefore typically needs to be integrated into the
> > generic encoder/decoder.  The definition of new tags with this
> > property is NOT RECOMMENDED.
> > 
> > So we should give guidance to the DEs for the registry in question to
> > that effect?
> 
> The DEs cannot enforce that — they would need a detailed understanding of the
> specification (if that is required for the range), and it is only NOT
> RECOMMENDED, so there may be a reason to deviate. Under the registration rules,
> the DEs could discuss the supposed need to not follow the NOT RECOMMENDED with
> the proposal authors.
> 
> > IANA allocated tag numbers 65535, 4294967295, and
> > 18446744073709551615 (binary all-ones in 16-bit, 32-bit, and 64-bit).
> > These can be used as a convenience for implementers that want a
> > single integer to indicate either that a specific tag is present, or
> > the absence of a tag.  That allocation is described in Section 10 of
> > 
> > (editorial) Chasing the reference, I suggest that it is a "single
> > integer *data structure*" in the implementation's internal
> > representation; just reading this text alone left me confused as to how
> > this was intended to be used.
> 
> We will make this "...implementers that want a single integer data structure to
> indicate..."
> 
>     
> > [I-D.bormann-cbor-notable-tags].  These tags are not intended to
> > occur in actual CBOR data items; implementations may flag such an
> > occurrence as an error.
> > 
> > I could maybe see this as "MAY".
> 
> Good point. We will change "may flag to "MAY flag".
> 
> > Section 3.4.2
> > 
> > Thank you for mentioning leap seconds!
> > 
> > Note that platform types for date/time may include null or undefined
> > values, which may also be desirable at an application protocol level.
> > While emitting tag number 1 values with non-finite tag content values
> > (e.g., with NaN for undefined date/time values or with Infinite for
> > an expiry date that is not set) may seem an obvious way to handle
> > this, using untagged null or undefined is often a better solution.
> > Application protocol designers are encouraged to consider these cases
> > and include clear guidelines for handling them.
> > 
> > It's rather unfortunate that the text here doesn't provide any
> > justification for the claim of "better solution" (or reference to such
> > justification).
> 
> We will change:
>    using untagged null or undefined is often a better solution.
> To:
>    using untagged null or undefined avoids the use of non-finites and results
>    in a shorter encoding.

Thanks!

> > Section 3.4.3
> > 
> > occurs in a bignum when using preferred serialization).  Note that
> > this means the non-preferred choice of a bignum representation
> > instead of a basic integer for encoding a number is not intended to
> > have application semantics (just as the choice of a longer basic
> > integer representation than needed, such as 0x1800 for 0x00 does
> > not).
> > 
> > It may be "not intended to", but it does, if you're using a decoder in
> > the generic data model.  We should be sure to cover the security
> > considerations of this disparity (and the corresponding need for an
> > application using CBOR to specify the data model it uses).
> 
> We will add text to the Security Considerations section about this disparity.
> 
> > Section 3.4.5.3
> > 
> > Note that tag numbers 33 and 34 differ from 21 and 22 in that the
> > data is transported in base-encoded form for the former and in raw
> > byte string form for the latter.
> > 
> > Do we want to mention tag 23 as well (as being the raw byte string)?
> 
> Tag 23 does not have a direct equivalent in the way that tags 33 and 34 have in
> tags 21 and 22. Mentioning the lack of a direct equivalent doesn't seem
> necessary and may actually cause confusion.
> 
> > Section 4.2.1
> > 
> > [I did not validate the hex-encoded IEEE754 against the decimal values.]
> > 
> > *  Indefinite-length items MUST NOT appear.  They can be encoded as
> >   definite-length items instead.
> > 
> > One could perhaps argue that a deterministic encoding procedure that
> > uses indefinite-length items is possible, and even useful in some cases.
> > This might argue for moving this requirement to Section 4.2.2's
> > list of "additional considerations".  That said, an application is not
> > obligated to use these core rules and can define its own rules if
> > needed, so I don't object to this requirement.
> 
> We believe this requirement should stay with the core requirements.

Okay.  Thanks for thinking about it.

> > Section 4.2.3
> > 
> > (Although [RFC7049] used the term "Canonical CBOR" for its form of
> > requirements on deterministic encoding, this document avoids this
> > term because "canonicalization" is often associated with specific
> > uses of deterministic encoding only.  The terms are essentially
> > interchangeable, however, and the set of core requirements in this
> > document could also be called "Canonical CBOR", while the length-
> > first-ordered version of that could be called "Old Canonical CBOR".)
> > 
> > If this document avoids the term, maybe the final sentence should not be
> > present?
> 
> It is not easy to manage the language that implementers will use. With documents
> such as RFC 8785, people will continue to use “canonical”, and we want to avoid
> a false alternative between “canonical” (old) and “deterministic” (new) because
> it might make people choose “canonical” for no good reason.
> 
> > Section 5
> > 
> > CBOR-based protocols MUST specify how their decoders handle invalid
> > and other unexpected data.  CBOR-based protocols MAY specify that
> > they treat arbitrary valid data as unexpected.  Encoders for CBOR-
> > based protocols MUST produce only valid items, that is, the protocol
> > cannot be designed to make use of invalid items.  An encoder can be
> > 
> > Just to check: my interpretation is that CBOR Sequences are compatible
> > with this requirement, since they use valid data items and just encode
> > them in sequence.  Right?
> 
> A CBOR sequence in general (except when it is exactly one data item) is not a
> well-formed (and therefore not a valid) data item. However, a CBOR sequence may
> be required by the application to be composed of valid data items.
> 
> > Section 5.1
> > 
> > Other decoders can present partial information about a top-level data
> > item to an application, such as the nested data items that could
> > already be decoded, or even parts of a byte string that hasn't
> > completely arrived yet.
> > 
> > This has potential to make some security types antsy, if coupled with
> > encryption mechanisms that release alleged plaintext prior to
> > authenticity check.  It's not immediately clear that this text needs to
> > change, though if it's also not a key point, perhaps it is easier to
> > just drop the mention rather than think about it more, though I'd also
> > be happy to see discussion of issues with streaming decryption in the
> > security considerations section.
> 
> We will will add:
>    Such an application also MUST have matching streaming security mechanism,
>    where the desired protection is available for incremental data presented
>    to the application.

(Not that I am complaining, but in some cases the "desired protection" may
be nil, making the MUST perhaps debatable vs lowercase must.)

> 
> > Section 5.3
> > 
> > A CBOR-based protocol MUST specify which of these options its
> > decoders take, for each kind of invalid item they might encounter.
> > 
> > Are the lists of types of validity error presented in the following
> > subsections exhaustive for the respective data models?  If so, it might
> > be worth mentioning that explicitly.
> 
> The WG worked hard on the lists of types of validity errors, and it is meant to
> be complete, but it would be dangerous to declare they are exhaustive.

I can understand that and sympathize!

> > Section 5.4
> > 
> > *  It can report an error (and not return data).  Note that this
> >   error is not a validity error per se.  This kind of error is more
> >   likely to be raised by a decoder that would be performing validity
> >   checking if this were a known case.
> > 
> > (soapbox) Could we maybe be a little less encouraging of this behavior?
> > I am remembering horror stories of TLS stacks that did this for
> > extension types, which is an interoperability nightmare.  I recognize
> > that there are cases where it is the desired behavior, but in the
> > general case tags are an extensibility point and we shouldn't encourage
> > that joint to rust shut.
> 
> We will add a sentence after the first sentence in the bullet point:
>    Note that treating this case as an error can cause ossification, and is
>    thus not encouraged.

Thanks!

> > Section 5.6.1
> > 
> > As discussed in Section 2.2, specific data models can make values
> > equivalent for the purpose of comparing map keys that are distinct in
> > the generic data model.  Note that this implies that a generic
> > decoder may deliver a decoded map to an application that needs to be
> > checked for duplicate map keys by that application (alternatively,
> > the decoder may provide a programming interface to perform this
> > service for the application).  Specific data models cannot
> > distinguish values for map keys that are equal for this purpose at
> > the generic data model level.
> > 
> > This last bit seems like something that is forbidden by the protocol (vs
> > "cannot"); I wonder if a slight rewording is in order.
> 
> We will change the sentence to read "Specific data models are not able to
> distinguish values..."
> 
> > Section 6.2
> > 
> > *  Numbers with fractional parts are represented as floating-point
> >   values, performing the decimal-to-binary conversion based on the
> >   precision provided by IEEE 754 binary64.  Then, when encoding in
> > 
> > I forget if this conversion requires round-to-nearest or if multiple
> > rounding modes are available (the latter would of course be problematic
> > if we proceed on to the "can be represented in smaller float without
> > changing value" step).
> 
> We will add "The mathematical value of the JSON number is converted to binary64
> using the roundTiesToEven procedure in Section 4.3.1 of [IEEE754] and update
> the reference to point to the 2019 version.

Great; thanks.

> > Section 8
> > 
> > The notation borrows the JSON syntax for numbers (integer and
> > floating-point), True (>true<), False (>false<), Null (>null<), UTF-8
> > 
> > (soapbox) Is literal '>' and '<' really the best quoting strategy here
> > (and later on)?
> 
> Yes. The actual best quoting would be using a font change, but that will be
> confusing for readers of some formats of the eventual RFC. Using quotation marks
> would definitely cause confusion in some readers that > and < do not.
> 
> > Section 9.1, 9.2
> > 
> > What guidance can we give to the experts?
> 
> We can't give any long-term guidance. Saying something like "it is good to
> balance the need for conserving the extensibility by keeping code points
> available and the need to react to requirements of applications" is possible,
> but is not actually useful and would simply become an attractive target for
> aggrieved applicants.
> 
> > Section 9.3
> > 
> > Applications that use this media type:  None yet, but it is expected
> >   that this format will be deployed in protocols and applications.
> > 
> > I don't believe this to be currently accurate.
> 
> Good catch. We will change this to "Many"
>     
> 
> > Additional information:  *  Magic number(s): n/a
> > 
> > I guess 0xd9d9f7 doesn't count, then?
> 
> It is far from a universal magic number.
> 
> > Section 9.4
> > 
> > The CoAP Content-Format for CBOR is defined in
> > [IANA.core-parameters]:
> > 
> > Is "defined in" the right way to word this?
> 
> We will change "defined in" to "registered in".
> 
> > Section 10
> > 
> > I guess the attack where you use indefinite-length encoding to achieve
> > total 'n' greater than 2**64 is not really practical at present...
> 
> Correct.
> 
> > Please add a mention of the risks of mixing a constrained decoder with a
> > variant (non-preferred-serialization) encoder, as alluded to in Section
> > 4.1.
> 
> We will add the following:
> 
> Section 4.1 gives examples of limitations in interoperability when using a
> constrained CBOR decoder with input from a CBOR encoder that uses a
> non-preferred serialization. When a single data item is consumed both by such a
> constrained decoder and a full decoder, it can lead to security issues that can
> be exploited by an attacker who can inject or manipulate content.

Excellent!

> > I also mention this down in G.3, but there seem to be some relevant
> > considerations regarding whether/when bignums and integers of the same
> > value are considered to be equivalent, in particular that the situation
> > is different depending on the data model in use.  This could probably
> > fit nicely into general discussion of handling the multiple possible
> > serializations of various data items.
> 
> We will add the following:
> 
> As discussed throughout this document, there are many values that can be
> considered "equivalent" in some circumstances and "not equivalent" in others. As
> just one example, the numeric value for the number "one" might be expressed as
> an integer or a bignum. A system interpreting CBOR input might accept either
> form for the number "one", or might reject one (or both) forms. Such acceptance
> or rejection can have security implications in the program that is using the
> interpreted input.
>     
> 
> > I would consider (but am not sure if I would end up adding) a mention
> > that CBOR can convey time values, and thus that protocols using CBOR to
> > convey time values are likely to rely on a source of accurate time.
> 
> This would only come up when an implementation compared a time value that had
> been encoded with CBOR with an actual clock. That situation would be identical
> to the time value encoded in any encoding system, and is thus not special to any
> encoding system.
> 
> > I might incorporate by reference the RFC 4648 security considerations
> > since we talk about base64 in several places.
> 
> Good point; added.
> 
> > Protocols using CBOR text strings will likely have internationalization
> > considerations; whether CBOR itself should mention this is not entirely
> > clear to me.
> 
> CBOR does not add any security considerations other than those already in UTF8.
> We will add a pointer to those.
> 
> > The potential loss of (e.g., type) information when converting from CBOR
> > to JSON is probably worth a mention, noting that applications performing
> > such conversions should consider whether they are affected and/or it's
> > desired to include specific type information in the generated JSON.
> 
> There are multiple ways to do this, which is one thing we could point out.
> 
> We will add:
>    It is common to convert CBOR data to other formats. In many cases, CBOR
>    has more expressive types than other formats; this is particularly
>    true for the common conversion to JSON. The loss of type information can
>    cause security issues for the systems that are processing the
>    less-expressive data.
> 
> > numbers may exceed linear effort.  Also, some hash-table
> > implementations that are used by decoders to build in-memory
> > representations of maps can be attacked to spend quadratic effort,
> > unless a secret key (see Section 7 of [SIPHASH]) or some other
> > mitigation is employed.  Such superlinear efforts can be exploited by
> > 
> > It seems likely that an alternate reference not behind a paywall would
> > be usable to make this point.
> 
> We will change to use the current reference and add a reference to one that is
> apparently maintained by one of the authors:
> https://131002.net/siphash/siphash.pdf

Thanks; all these security considerations changes look really good in the
-15.

> 
> > Section 11.2
> > 
> > Would not [PCRE] need to be normative (if that functionality remains,
> > per the DISCUSS)?
> 
> Please see the discussion under the DISCUSS for this.
> 
> > Appendix A
> > 
> > [I did not verify the examples.]
> > 
> > ATTIC FIFTY STATERS).  (Note that all these single-character strings
> > could also be represented in native UTF-8 in diagnostic notation,
> > just not in an ASCII-only specification like the present one.)  In
> > 
> > The present specification is not ASCII-only...
> 
> Good catch. We will remove "like the present one".
> 
> > Appendix C
> > 
> >  return 0;                     // no break out
> > 
> > Should this be 'return mt'?  IIUC the return value is a message type
> > or -1 for the break code, and errors are indicated out of band via
> > fail().
> 
> This was caught earlier by Stuart Cheshire, and we have a more complete fix
> already in the repo (PR #203).
> 
> > void encode_sint(int64_t n) {
> >  uint64t ui = n >> 63;    // extend sign to whole length
> >  mt = ui & 0x20;          // extract major type
> > 
> > If this is supposed to be C, you probably want to declare mt.
> 
> It is labeled as pseudo-code (hard to compile with an ellipsis in there).
> But it doesn’t hurt to declare mt.
> 
> This is pseudocode, not actual C. However, we will declare mt in it to make it
> clearer.
> 
> > 
> > Appendix F
> > 
> > *  major type 7, additional information 24, value < 32 (incorrect or
> >   incorrectly encoded simple type)
> > 
> > I see "incorrectly encoded", but I'm not sure I understand what is meant
> > by "incorrect simple type".
> 
> We will change this to "(incorrect)".
> 
> > 
> > Appendix G.3
> > 
> > integers and floating point values.  Experience from implementation
> > and use now suggested that the separation between these two number
> > domains should be more clearly drawn in the document; language that
> > suggested an integer could seamlessly stand in for a floating point
> > value was removed.  Also, a suggestion (based on I-JSON [RFC7493])
> > 
> > So instead we have skew between the generic data model and the extended
> > model, where the generic model thinks some numers are different that the
> > extended model treats as the same.  Should we mention that here as well?
> 
> That is not really a change from RFC 7049; clearly identifying the data models
> is a new way to discuss CBOR that is used in many places in the current
> document.
> 

Okay.

Thanks for all the updates; I will go update my ballot position now.

-Ben