Re: [Cbor] [Ext] Benjamin Kaduk's Discuss on draft-ietf-cbor-7049bis-14: (with DISCUSS and COMMENT)
Benjamin Kaduk <kaduk@mit.edu> Mon, 28 September 2020 05:25 UTC
Return-Path: <kaduk@mit.edu>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A9B423A0E3C; Sun, 27 Sep 2020 22:25:53 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.796
X-Spam-Level:
X-Spam-Status: No, score=-1.796 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_BLOCKED=0.001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URI_HEX=0.1] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id O9z1GUOJZ0Hi; Sun, 27 Sep 2020 22:25:51 -0700 (PDT)
Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 9DE343A0E31; Sun, 27 Sep 2020 22:25:50 -0700 (PDT)
Received: from kduck.mit.edu ([24.16.140.251]) (authenticated bits=56) (User authenticated as kaduk@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 08S5PQ63018035 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 28 Sep 2020 01:25:28 -0400
Date: Sun, 27 Sep 2020 22:25:25 -0700
From: Benjamin Kaduk <kaduk@mit.edu>
To: Paul Hoffman <paul.hoffman@icann.org>
Cc: The IESG <iesg@ietf.org>, "cbor@ietf.org" <cbor@ietf.org>
Message-ID: <20200928052525.GJ89563@kduck.mit.edu>
References: <159951641517.13535.7424396818917958932@ietfa.amsl.com> <D0F65E21-94A6-4987-ABCB-BE2B76E8BAAC@icann.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <D0F65E21-94A6-4987-ABCB-BE2B76E8BAAC@icann.org>
User-Agent: Mutt/1.12.1 (2019-06-15)
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/kmCiVQhtMlC96Ne0K3gy_nikudY>
Subject: Re: [Cbor] [Ext] Benjamin Kaduk's Discuss on draft-ietf-cbor-7049bis-14: (with DISCUSS and COMMENT)
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Sep 2020 05:25:54 -0000
Hi Paul, Sorry for the week's delay -- you caught me as I was recovering from the tail end of some bug and pressed to meet the telechat deadline. On Mon, Sep 21, 2020 at 03:59:39PM +0000, Paul Hoffman wrote: > Here are our responses to the COMMENTs. This is a follow-up to our earlier > responses to the DISCUSSes. > > --Paul Hoffman > > > ---------------------------------------------------------------------- > > COMMENT: > > ---------------------------------------------------------------------- > > > > Is there a comprehensive list of things that generic (en/de)coders need > > to document their behavior for (e.g., how they handle duplicate map > > keys; whether/what validity checking is done, including which tag > > numbers are supported)? > > No. Such a document coming from the CBOR WG in the future could be useful. Definitely. (But it sounds like there isn't one in the works yet.) > > We use the expression "simple value" around 30 times, but "simple type > > value" only twice (and "simple type" a few other times); are we happy > > with the consistency of usage? > > We simplified to the common expression in the first case and reworded > the second; we used the opportunity to increase the consistency of the > terminology some more. Thanks! > > Please also note my comments on the IANA considerations in the > > per-section comments; at least the first couple are fairly > > consequential. > > Noted and commented on. > > > I'm pretty sympathetic to the secdir reviewer's desire for guidance on > > how to implement validity checking. I think it would be possible to > > slot this into the existing discussion of validity in §5.3/5.4, possibly as > > an additional subsection reiterating that it's required to check the > > bits in 5.3.1/5.3.2, and the expectation that such checks are likely to > > be incomplete in the face of new tag number allocations. > > Adding such guidance could indeed be useful, but not at this late stage in the > document. It seems like good work for the CBOR WG to adopt in the future. > > > Section 1.2 > > > > Where bit arithmetic or data types are explained, this document uses > > the notation familiar from the programming language C, except that > > > > In recent memory we've asked for some form of reference for "the > > programming language C" (even though the concepts we draw on are likely > > to remain invariant for anything called C). > > We will add references to C that match those from the recently-approved RFC 8746. > Done in <https://github.com/cbor-wg/CBORbis/pull/208>. > > > Section 2 > > > > In the basic (un-extended) generic data model, a data item is one of: > > [...] > > * a sequence of zero or more Unicode code points ("text string") > > > > Hmm, since we use "data item" for both the abstract idea and the > > representation-format version, this description is only precise for the > > abstract version (the representation is further constrained to UTF-8). > > I am not sure whether there is a concise way to accurately express this > > state, though. > > We have been using “encoded data item” for the latter; as we are talking about > the model level, it should be reasonably clear that we are talking about the > conceptual data item here. (There is no great word for that; model-level data > item is maybe another, as clumsy as conceptual.) Okay, let's leave this as-is for want of a better alternative. > > Section 3 > > > > The initial byte and any additional bytes consumed to construct the > > argument are collectively referred to as the "head" of the data item. > > > > side note: Interesting that we define "head" but do not use "tail" :) > > There was no need for us to define tail/body in this spec. > > > Section 3.3 > > > > meaning, as defined in Table 3. Like the major types for integers, > > items of this major type do not carry content data; all the > > information is in the initial bytes. > > > > (editorial) The "head", as it were, right? > > Yes, we will add that as a parenthesis. > > > Section 3.4 > > > > Conceptually, tags are interpreted in the generic data model, not at > > (de-)serialization time. A small number of tags (specifically, tag > > number 25 and tag number 29) have been registered with semantics that > > may require processing at (de-)serialization time: The decoder needs > > > > I suggest adding additional language to reiterate that this is a > > point-in-time statement (and thus that there may be other such tags in > > existence). > > We will replace "specifically" with "at this time" "at the time of this writing" might age better, though I expect the RPC staff to flag this, too. > > This means these tags cannot be implemented on top of every generic > > CBOR encoder/decoder (which might not reflect the serialization order > > for entries in a map at the data model level and vice versa); their > > implementation therefore typically needs to be integrated into the > > generic encoder/decoder. The definition of new tags with this > > property is NOT RECOMMENDED. > > > > So we should give guidance to the DEs for the registry in question to > > that effect? > > The DEs cannot enforce that — they would need a detailed understanding of the > specification (if that is required for the range), and it is only NOT > RECOMMENDED, so there may be a reason to deviate. Under the registration rules, > the DEs could discuss the supposed need to not follow the NOT RECOMMENDED with > the proposal authors. > > > IANA allocated tag numbers 65535, 4294967295, and > > 18446744073709551615 (binary all-ones in 16-bit, 32-bit, and 64-bit). > > These can be used as a convenience for implementers that want a > > single integer to indicate either that a specific tag is present, or > > the absence of a tag. That allocation is described in Section 10 of > > > > (editorial) Chasing the reference, I suggest that it is a "single > > integer *data structure*" in the implementation's internal > > representation; just reading this text alone left me confused as to how > > this was intended to be used. > > We will make this "...implementers that want a single integer data structure to > indicate..." > > > > [I-D.bormann-cbor-notable-tags]. These tags are not intended to > > occur in actual CBOR data items; implementations may flag such an > > occurrence as an error. > > > > I could maybe see this as "MAY". > > Good point. We will change "may flag to "MAY flag". > > > Section 3.4.2 > > > > Thank you for mentioning leap seconds! > > > > Note that platform types for date/time may include null or undefined > > values, which may also be desirable at an application protocol level. > > While emitting tag number 1 values with non-finite tag content values > > (e.g., with NaN for undefined date/time values or with Infinite for > > an expiry date that is not set) may seem an obvious way to handle > > this, using untagged null or undefined is often a better solution. > > Application protocol designers are encouraged to consider these cases > > and include clear guidelines for handling them. > > > > It's rather unfortunate that the text here doesn't provide any > > justification for the claim of "better solution" (or reference to such > > justification). > > We will change: > using untagged null or undefined is often a better solution. > To: > using untagged null or undefined avoids the use of non-finites and results > in a shorter encoding. Thanks! > > Section 3.4.3 > > > > occurs in a bignum when using preferred serialization). Note that > > this means the non-preferred choice of a bignum representation > > instead of a basic integer for encoding a number is not intended to > > have application semantics (just as the choice of a longer basic > > integer representation than needed, such as 0x1800 for 0x00 does > > not). > > > > It may be "not intended to", but it does, if you're using a decoder in > > the generic data model. We should be sure to cover the security > > considerations of this disparity (and the corresponding need for an > > application using CBOR to specify the data model it uses). > > We will add text to the Security Considerations section about this disparity. > > > Section 3.4.5.3 > > > > Note that tag numbers 33 and 34 differ from 21 and 22 in that the > > data is transported in base-encoded form for the former and in raw > > byte string form for the latter. > > > > Do we want to mention tag 23 as well (as being the raw byte string)? > > Tag 23 does not have a direct equivalent in the way that tags 33 and 34 have in > tags 21 and 22. Mentioning the lack of a direct equivalent doesn't seem > necessary and may actually cause confusion. > > > Section 4.2.1 > > > > [I did not validate the hex-encoded IEEE754 against the decimal values.] > > > > * Indefinite-length items MUST NOT appear. They can be encoded as > > definite-length items instead. > > > > One could perhaps argue that a deterministic encoding procedure that > > uses indefinite-length items is possible, and even useful in some cases. > > This might argue for moving this requirement to Section 4.2.2's > > list of "additional considerations". That said, an application is not > > obligated to use these core rules and can define its own rules if > > needed, so I don't object to this requirement. > > We believe this requirement should stay with the core requirements. Okay. Thanks for thinking about it. > > Section 4.2.3 > > > > (Although [RFC7049] used the term "Canonical CBOR" for its form of > > requirements on deterministic encoding, this document avoids this > > term because "canonicalization" is often associated with specific > > uses of deterministic encoding only. The terms are essentially > > interchangeable, however, and the set of core requirements in this > > document could also be called "Canonical CBOR", while the length- > > first-ordered version of that could be called "Old Canonical CBOR".) > > > > If this document avoids the term, maybe the final sentence should not be > > present? > > It is not easy to manage the language that implementers will use. With documents > such as RFC 8785, people will continue to use “canonical”, and we want to avoid > a false alternative between “canonical” (old) and “deterministic” (new) because > it might make people choose “canonical” for no good reason. > > > Section 5 > > > > CBOR-based protocols MUST specify how their decoders handle invalid > > and other unexpected data. CBOR-based protocols MAY specify that > > they treat arbitrary valid data as unexpected. Encoders for CBOR- > > based protocols MUST produce only valid items, that is, the protocol > > cannot be designed to make use of invalid items. An encoder can be > > > > Just to check: my interpretation is that CBOR Sequences are compatible > > with this requirement, since they use valid data items and just encode > > them in sequence. Right? > > A CBOR sequence in general (except when it is exactly one data item) is not a > well-formed (and therefore not a valid) data item. However, a CBOR sequence may > be required by the application to be composed of valid data items. > > > Section 5.1 > > > > Other decoders can present partial information about a top-level data > > item to an application, such as the nested data items that could > > already be decoded, or even parts of a byte string that hasn't > > completely arrived yet. > > > > This has potential to make some security types antsy, if coupled with > > encryption mechanisms that release alleged plaintext prior to > > authenticity check. It's not immediately clear that this text needs to > > change, though if it's also not a key point, perhaps it is easier to > > just drop the mention rather than think about it more, though I'd also > > be happy to see discussion of issues with streaming decryption in the > > security considerations section. > > We will will add: > Such an application also MUST have matching streaming security mechanism, > where the desired protection is available for incremental data presented > to the application. (Not that I am complaining, but in some cases the "desired protection" may be nil, making the MUST perhaps debatable vs lowercase must.) > > > Section 5.3 > > > > A CBOR-based protocol MUST specify which of these options its > > decoders take, for each kind of invalid item they might encounter. > > > > Are the lists of types of validity error presented in the following > > subsections exhaustive for the respective data models? If so, it might > > be worth mentioning that explicitly. > > The WG worked hard on the lists of types of validity errors, and it is meant to > be complete, but it would be dangerous to declare they are exhaustive. I can understand that and sympathize! > > Section 5.4 > > > > * It can report an error (and not return data). Note that this > > error is not a validity error per se. This kind of error is more > > likely to be raised by a decoder that would be performing validity > > checking if this were a known case. > > > > (soapbox) Could we maybe be a little less encouraging of this behavior? > > I am remembering horror stories of TLS stacks that did this for > > extension types, which is an interoperability nightmare. I recognize > > that there are cases where it is the desired behavior, but in the > > general case tags are an extensibility point and we shouldn't encourage > > that joint to rust shut. > > We will add a sentence after the first sentence in the bullet point: > Note that treating this case as an error can cause ossification, and is > thus not encouraged. Thanks! > > Section 5.6.1 > > > > As discussed in Section 2.2, specific data models can make values > > equivalent for the purpose of comparing map keys that are distinct in > > the generic data model. Note that this implies that a generic > > decoder may deliver a decoded map to an application that needs to be > > checked for duplicate map keys by that application (alternatively, > > the decoder may provide a programming interface to perform this > > service for the application). Specific data models cannot > > distinguish values for map keys that are equal for this purpose at > > the generic data model level. > > > > This last bit seems like something that is forbidden by the protocol (vs > > "cannot"); I wonder if a slight rewording is in order. > > We will change the sentence to read "Specific data models are not able to > distinguish values..." > > > Section 6.2 > > > > * Numbers with fractional parts are represented as floating-point > > values, performing the decimal-to-binary conversion based on the > > precision provided by IEEE 754 binary64. Then, when encoding in > > > > I forget if this conversion requires round-to-nearest or if multiple > > rounding modes are available (the latter would of course be problematic > > if we proceed on to the "can be represented in smaller float without > > changing value" step). > > We will add "The mathematical value of the JSON number is converted to binary64 > using the roundTiesToEven procedure in Section 4.3.1 of [IEEE754] and update > the reference to point to the 2019 version. Great; thanks. > > Section 8 > > > > The notation borrows the JSON syntax for numbers (integer and > > floating-point), True (>true<), False (>false<), Null (>null<), UTF-8 > > > > (soapbox) Is literal '>' and '<' really the best quoting strategy here > > (and later on)? > > Yes. The actual best quoting would be using a font change, but that will be > confusing for readers of some formats of the eventual RFC. Using quotation marks > would definitely cause confusion in some readers that > and < do not. > > > Section 9.1, 9.2 > > > > What guidance can we give to the experts? > > We can't give any long-term guidance. Saying something like "it is good to > balance the need for conserving the extensibility by keeping code points > available and the need to react to requirements of applications" is possible, > but is not actually useful and would simply become an attractive target for > aggrieved applicants. > > > Section 9.3 > > > > Applications that use this media type: None yet, but it is expected > > that this format will be deployed in protocols and applications. > > > > I don't believe this to be currently accurate. > > Good catch. We will change this to "Many" > > > > Additional information: * Magic number(s): n/a > > > > I guess 0xd9d9f7 doesn't count, then? > > It is far from a universal magic number. > > > Section 9.4 > > > > The CoAP Content-Format for CBOR is defined in > > [IANA.core-parameters]: > > > > Is "defined in" the right way to word this? > > We will change "defined in" to "registered in". > > > Section 10 > > > > I guess the attack where you use indefinite-length encoding to achieve > > total 'n' greater than 2**64 is not really practical at present... > > Correct. > > > Please add a mention of the risks of mixing a constrained decoder with a > > variant (non-preferred-serialization) encoder, as alluded to in Section > > 4.1. > > We will add the following: > > Section 4.1 gives examples of limitations in interoperability when using a > constrained CBOR decoder with input from a CBOR encoder that uses a > non-preferred serialization. When a single data item is consumed both by such a > constrained decoder and a full decoder, it can lead to security issues that can > be exploited by an attacker who can inject or manipulate content. Excellent! > > I also mention this down in G.3, but there seem to be some relevant > > considerations regarding whether/when bignums and integers of the same > > value are considered to be equivalent, in particular that the situation > > is different depending on the data model in use. This could probably > > fit nicely into general discussion of handling the multiple possible > > serializations of various data items. > > We will add the following: > > As discussed throughout this document, there are many values that can be > considered "equivalent" in some circumstances and "not equivalent" in others. As > just one example, the numeric value for the number "one" might be expressed as > an integer or a bignum. A system interpreting CBOR input might accept either > form for the number "one", or might reject one (or both) forms. Such acceptance > or rejection can have security implications in the program that is using the > interpreted input. > > > > I would consider (but am not sure if I would end up adding) a mention > > that CBOR can convey time values, and thus that protocols using CBOR to > > convey time values are likely to rely on a source of accurate time. > > This would only come up when an implementation compared a time value that had > been encoded with CBOR with an actual clock. That situation would be identical > to the time value encoded in any encoding system, and is thus not special to any > encoding system. > > > I might incorporate by reference the RFC 4648 security considerations > > since we talk about base64 in several places. > > Good point; added. > > > Protocols using CBOR text strings will likely have internationalization > > considerations; whether CBOR itself should mention this is not entirely > > clear to me. > > CBOR does not add any security considerations other than those already in UTF8. > We will add a pointer to those. > > > The potential loss of (e.g., type) information when converting from CBOR > > to JSON is probably worth a mention, noting that applications performing > > such conversions should consider whether they are affected and/or it's > > desired to include specific type information in the generated JSON. > > There are multiple ways to do this, which is one thing we could point out. > > We will add: > It is common to convert CBOR data to other formats. In many cases, CBOR > has more expressive types than other formats; this is particularly > true for the common conversion to JSON. The loss of type information can > cause security issues for the systems that are processing the > less-expressive data. > > > numbers may exceed linear effort. Also, some hash-table > > implementations that are used by decoders to build in-memory > > representations of maps can be attacked to spend quadratic effort, > > unless a secret key (see Section 7 of [SIPHASH]) or some other > > mitigation is employed. Such superlinear efforts can be exploited by > > > > It seems likely that an alternate reference not behind a paywall would > > be usable to make this point. > > We will change to use the current reference and add a reference to one that is > apparently maintained by one of the authors: > https://131002.net/siphash/siphash.pdf Thanks; all these security considerations changes look really good in the -15. > > > Section 11.2 > > > > Would not [PCRE] need to be normative (if that functionality remains, > > per the DISCUSS)? > > Please see the discussion under the DISCUSS for this. > > > Appendix A > > > > [I did not verify the examples.] > > > > ATTIC FIFTY STATERS). (Note that all these single-character strings > > could also be represented in native UTF-8 in diagnostic notation, > > just not in an ASCII-only specification like the present one.) In > > > > The present specification is not ASCII-only... > > Good catch. We will remove "like the present one". > > > Appendix C > > > > return 0; // no break out > > > > Should this be 'return mt'? IIUC the return value is a message type > > or -1 for the break code, and errors are indicated out of band via > > fail(). > > This was caught earlier by Stuart Cheshire, and we have a more complete fix > already in the repo (PR #203). > > > void encode_sint(int64_t n) { > > uint64t ui = n >> 63; // extend sign to whole length > > mt = ui & 0x20; // extract major type > > > > If this is supposed to be C, you probably want to declare mt. > > It is labeled as pseudo-code (hard to compile with an ellipsis in there). > But it doesn’t hurt to declare mt. > > This is pseudocode, not actual C. However, we will declare mt in it to make it > clearer. > > > > > Appendix F > > > > * major type 7, additional information 24, value < 32 (incorrect or > > incorrectly encoded simple type) > > > > I see "incorrectly encoded", but I'm not sure I understand what is meant > > by "incorrect simple type". > > We will change this to "(incorrect)". > > > > > Appendix G.3 > > > > integers and floating point values. Experience from implementation > > and use now suggested that the separation between these two number > > domains should be more clearly drawn in the document; language that > > suggested an integer could seamlessly stand in for a floating point > > value was removed. Also, a suggestion (based on I-JSON [RFC7493]) > > > > So instead we have skew between the generic data model and the extended > > model, where the generic model thinks some numers are different that the > > extended model treats as the same. Should we mention that here as well? > > That is not really a change from RFC 7049; clearly identifying the data models > is a new way to discuss CBOR that is used in many places in the current > document. > Okay. Thanks for all the updates; I will go update my ballot position now. -Ben
- [Cbor] Benjamin Kaduk's Discuss on draft-ietf-cbo… Benjamin Kaduk via Datatracker
- Re: [Cbor] Benjamin Kaduk's Discuss on draft-ietf… Jim Schaad
- Re: [Cbor] Benjamin Kaduk's Discuss on draft-ietf… Benjamin Kaduk
- Re: [Cbor] Benjamin Kaduk's Discuss on draft-ietf… Jim Schaad
- Re: [Cbor] [Ext] Benjamin Kaduk's Discuss on draf… Paul Hoffman
- Re: [Cbor] [Ext] Benjamin Kaduk's Discuss on draf… Benjamin Kaduk
- Re: [Cbor] [Ext] Benjamin Kaduk's Discuss on draf… Paul Hoffman
- Re: [Cbor] [Ext] Benjamin Kaduk's Discuss on draf… Benjamin Kaduk