Re: [Cbor] my (WGLC re-)views on error processing in RFC7049bis and future-proofing

Carsten Bormann <cabo@tzi.org> Fri, 15 May 2020 23:28 UTC

Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.80.23.2.2\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <2963.1589473899@localhost>
Date: Sat, 16 May 2020 01:27:50 +0200
Cc: cbor@ietf.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <BC0EC9BE-4202-4EED-A619-CDEB9BF312CE@tzi.org>
References: <17300.1588779159@localhost> <38BB6FFF-737F-4C11-AD7A-DA3F28A9F570@tzi.org> <CANh-dXkdjMyO=WFUxrF06OfP+RE9v11unKJXL8P3UtEe+prV1w@mail.gmail.com> <13690.1588894939@localhost> <CANh-dXmjD=RCwh7ExjSvFx+5ciew+eqHoVS88OommQ2xVnX5=Q@mail.gmail.com> <2963.1589473899@localhost>
To: Michael Richardson <mcr+ietf@sandelman.ca>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/le5d-Au4iysI7pr0-F1dEUVm00c>
Subject: Re: [Cbor] my (WGLC re-)views on error processing in RFC7049bis and future-proofing
Precedence: list

Hi Michael,

thank you for taking the time to dive deep!

I’m going to send multiple versions of this reply, with more and more answers filled in; I hope this is not too jarring (you can simply ignore the initial versions if it is).  This is version 1 of my response.

> This document says it is going to Internet Standard, but *only* in the Contributing section.

Fellow https://datatracker.ietf.org/doc/draft-ietf-cose-rfc8152bis-struct/?include_text=1
does not say it at all…
(Well, datatracker knows it, but the document is silent.)

> Please also forgive me that I haven't re-read RFC7049 in a long while, so I
> might be complaining about things that are just matching 7049.

We want to improve on 7049, so if there is a reason to complain, please do!

> Jeffrey Yasskin <jyasskin@chromium.org> wrote:
>>> I am a user of parsers, I have occasionally had to write my own
>>> conversions, but mostly I would say that I am not up to speed on some
>>> of the details.
> 
>> That's all reasonable. :) Given the discussion, and Lawrence's quick
>> analysis of the existing tags, how do you currently feel about the
>> state of RFC7049bis's requirements around error handling?
> 
> Maybe time for a top-to-bottom read. Probably appropriate thing to do during WGLC :-)
> 
> Intro does say: "It does not create a new version of the format."
> 
> I think that I'd like to amend this to say that:
> 
>   This document is a revised edition of [RFC7049], with editorial
>   improvements, added detail, and fixed errata.  This revision formally
>   obsoletes RFC 7049, while keeping full compatibility of the
>   interchange format from RFC 7049.  It does not create a new version
>   of the format.
> 
> ->
>   This document is a revised edition of [RFC7049], with editorial
>   improvements, added detail, and fixed errata.  In clarifying some
>   interpretations of [RFC7049] it may in some cases create situations where
>   an existing parser may no longer comply to this specification.
>   While this revision formally obsoletes RFC 7049, it does not obsolete
>   any valid encoders, and thus keeps full compatibility with the
>   interchange format from RFC 7049.  It does not create a new version
>   of the format.
> 
> 
> I note this text (section 2.1):
> 
>   While there is a strong expectation that generic encoders and
>   decoders can represent "false", "true", and "null" ("undefined" is
>   intentionally omitted) in the form appropriate for their programming
>   environment, implementation of the data model extensions created by
>   tags is truly optional and a matter of implementation quality.
> 
> This seems to have something to do with tags.
> 
> Also section 2.2 ends with:
> 
>   "0.0" as an integer (major type 0, Section 3.1).  However, if a
>   specific data model declares that floating-point and integer
>   representations of integral values are equivalent, using both map
>   keys "0" and "0.0" in a single map would be considered duplicates,
>   even while encoded as different major types, and so invalid; and an
>   encoder could encode integral-valued floats as integers or vice
>   versa, perhaps to save encoded bytes.
> 
> To me, this says that all generic encoders that intend to return data in the
> native form of their programming environment need to be configured as to the
> protocol.  This is supporting my suggestion that a well designed library
> would/could have to be configured for the specific data model when it comes
> to how unknown tags are treated.
> 
>> * I'm not opposed to advice for parsers to have an option to treat a
>> value tagged with an unknown tag as equivalent to the value itself.
> 
>> * I dislike the idea of any generic parser doing that by default, I think
>> based on reasoning like in
>> https://tools.ietf.org/html/draft-iab-protocol-maintenance-04.  * If a
>> parser passes unknown tags up to the application, a higher-level
>> protocol can ignore them itself, skip their data item, return an error,
>> or do something else appropriate to the context. So if the RFC should
>> recommend a default in generic parsers, I'd vote for that one.  * I
>> don't intend to draft this wording myself. :)
> 
> I think that I'm arguing for generic parsers to be in RFC7049 mode, which
> might mean ignoring unknown tags (if that's what they did before), passing
> the data up using whatever native interpretation there is, until they are
> configured otherwise.
> 
> That is, if there is a seconds-since-epoch tag (XXX) which a generic parser
> did not understand, followed by an integer, that it would return an integer
> if it did not have an interface that passed tags.
> 
>>  28, 29, 30:  These values are reserved for future additions to the
>>     CBOR format.  In the present version of CBOR, the encoded item is
>>     not well-formed.
> 
> I think that there is a bug here.
> What should a parser written today do when it encounters these values?
> (forward reference to section 7.2?)

Give up (not well-formed), as there is no way to know how big the head with these ai values is.

> Getting this right is how we deal with future-proofing.

We do have extension points that have full compatibility; this potential one just doesn’t.

> It seems seeing such a thing means a current decoder has to abort/fail.
> What we write here has a profound implication, I think, on how easily we
> could act on the advice of section 7.2.  

It would be a CBOR 1.1 (or 2.0), it will not be easy on existing decoders!

> Section 10, first paragraph implies
> we should say something.
> 
> In general, I think that the details in this introductionary encoding section
> are too detailed, particularly for 31.  I think that detail belongs later
> on. I got no value (I retained nothing) from having that level of detail there.
> 
> I wonder if section 3.1, under major type 0 should give clarify that "0"
> is encoded as 0b000_00000. (That is no negative 0)
> 
>   "A string containing an invalid UTF-8 sequence is well-
>    formed but invalid."
> 
> I think that this might need clarification.
> 
> I guess that RFC8742 include sequences of 7049bis CBOR sequences.
> I wonder if Updates 8742 is appropriate.

You lost me here.

> 
>>  If the break stop code appears after a key in a map, in place of that
>>  key's value, the map is not well-formed.
> 
> This does mean that the entire map is not well-formed, or just the
> key/value pair where this occurs?  I take the first meaning, but I want to be
> sure.

It says that the map is not well-formed, but of course the whole data item is dubious as it is not clear whether the map has ended or not.  So, again, give up.  Should be clear in Appendix C.

> 3.2.3:
>   (Note that zero-length
>   chunks, while not particularly useful, are permitted.)
> 
> they might be useful in non-TCP/IP situations where it is useful to send a
> "keep-alive" on some channel.

We haven’t addressed general padding im CBOR (which would again require a 1.1 or 2.0 maybe), and I would hate to suggest this here as the only padding that CBOR already offers.

> I think that it might be cleaner to swap the order of sections 3.2 (infinite
> length things), and 3.3 (floating-point and stuff).  This just puts major
> type 7 more in context first.
> 
>>  As with all other major types, the 5-bit value 24 signifies a single-
>>  byte extension: it is followed by an additional byte to represent the
>>  simple value.  (To minimize confusion, only the values 32 to 255 are
>>  used.)  This maintains the structure of the initial bytes: as for the
> 
> Or, to put it another way, 5-bit Values 24->31  in table 3 are also "Simple
> Values".    

24, yes, 25 to 27 no (floating point), and 28 to 30 are reserved.
(31 is the break stop code and not a value at all.)

> Could future Simple Values (such as 0..19) can, have complex
> structure the way that values 24->27 do?

No, the general syntax of heads does apply to the unallocated code points as well.

> Or to put it another way, can a decoder depend upon unassigned simple values
> having the one-or-two byte structure presented and be able to skip unknown
> values?  

Yes.

> Or does a decoder that encounters undefined values here have to
> fail?

No:

5.2.: Generic encoders and decoders are
   expected to forward simple values and tags even if their specific
   codepoints are not registered at the time the encoder/decoder is
   written (Section 5.4).

>>  formed.  (This implies that an encoder cannot encode false, true,
>>  null, or undefined in two-byte sequences, only the one-byte variants
>>  of these are well-formed.)
> 
> I here suggest the text say:
> 
> +   formed.  (This implies that an encoder cannot encode false, true,
> +   null, floats, undefined-23, reserved-[28..31], or break in two-byte
> +   sequences, only the one-byte variants of these are well-formed.)

The implication we give doesn’t have to be exhaustive.  The sentence you propose is very confusing because it is at a different level: head syntax, but that is not the issue here.  The text here is about the choice that the general head syntax seems to leave for simple values 0..23, which we are saying it actually doesn’t.

> While it's too late to change, was there a reason "True" didn't get 0b111_00001?

Yes.  An earlier version of CBOR used simple values 0..15 for some form of compression.
So we allocated the other simple values at the top, not at the bottom.

> Clearly False, and Null would then compete to be 0b111_00000, and maybe
> that's reason enough to not play such games.

Right.

> 
> Section 3.4 says:
> 
> }   Their primary purpose in this specification is to define common data
> }   types such as dates.  A secondary purpose is to provide conversion
> }   hints when it is foreseen that the CBOR data item needs to be
> }   translated into a different format, requiring hints about the content
> }   of items.
> 
> I don't think that the "primary purpose" is still just dates.
> The note about "hints" suggests that tags are always advisory, and I think
> that this thread has established that for some protocols, they really are
> not.
> 
> }   Understanding the semantics of tags is optional for a
> }   decoder; it can simply present both the tag number and the tag
> }   content to the application, without interpreting the additional
> }   semantics of the tag.
> 
> I wonder if this text should be stronger.
> Maybe:
> 
> +   Understanding the semantics of every tag is optional for a decoder;
> +   a decoder MAY simply present some or all tags to the
> +   application, without interpreting the additional
> +   semantics of the tag.
> 
> I would then go on:
> +   Decoders which translate CBOR values into language specific objects,
> +   (e.g., dates, bignum, example3, ...) MAY consume the tags along with
> +   the values, returning only the language defined object.
> 
> I note that AFAIK, we do not use tag#24 (Encoded CBOR data item) for the
> signed object, in COSE.  Should we?
> What's the difference between #24 and #55799.

55799 is a tag that can have any CBOR data item as tag content
24 is a tag that can only be on byte strings.
The byte string then *encodes* another CBOR data item.
(The main use here is to keep the decoder from decoding, to provide easy skip-ability or because we need exact bytes as in COSE.)
As often with tags, there is no need for tag 24 on a byte string when it is clear from context that the byte string contains encoded CBOR; this is the case in COSE.

> I guess I will read onwards to find out... Got it.
> 
> BTW: Tag 25 and 29 are called out after Table 5, but are not listed *in* table 5.
> That whole paragraph could use some more periods, and maybe a blank line.
> I'm still loss as to why <untagged><null> is better than <epoch><null>.
> 
> Why can't we use decimal fractions, or bigfloats for time?

That may have been a mistake (which is one reason we have tag 1001 now).
The WG has generally taken a dim view on extending the domain (allowable syntax for tag content) for a tag, so we can’t “fix” that — note that for the date tag, we have taken the decision not to reuse one tag for two different tag content syntaxes either.

> I suppose float64 has enough precision for a millenia or so, even if one
> wants microseconds precision.

It has 53 bits of precision, so it gives ~ microseconds (~ 20 bits after the decimal point) for 2**33 seconds or 272 years from the epoch or until 2242.  Four microseconds until 3058.

> I think that words "bytewise lexicographic order" used in 4.2.1 may not
> survive translations in a meaningful way.  The eight item example might be
> clearer if presented in a table so that the bytes can be lined up.
> If keys of a map have tags, I assume that the tags are to be included in the
> lexicographically order?  Maybe 4.2.3 could cover both ways with a forward reference?
> 
> I think that 4.2.2 gets into whether or not a tag is required, and I think
> that it might need to be considered more in context of when tags can be
> skipped.
> 
> I think that the the Introduction should have a section 1.3 that addresses
> the concept of "Protocols" on top of CBOR, referencing section 5.
> I think that 4.2.2 should forward reference to 5, or maybe sections 4 and 5
> I suggest "protocol" be capitalized consistently as Protocol when it is used
> in this way.
> 
> I don't find that section 5.2 fits into section 5.
> I think we already covered this concept.
> 
>   "0x62c0ae" does not contain valid UTF-8 and so is not a valid CBOR
>   item.  ......
>   Generic encoders and decoders are
>   expected to forward simple values and tags even if their specific
>   codepoints are not registered at the time the encoder/decoder is
>   written (Section 5.4).
> 
>   Generic decoders provide ways to present well-formed CBOR values,
>   both valid and invalid, to an application.  The diagnostic notation
>   (Section 8) may be used to present well-formed CBOR values to humans.
> 
> I don't personally know enough UTF-8 to know why the above is invalid UTF-8.
> Maybe saying that it's not because c0ae is an unsigned code point by because...
> (I remember the valid MIME and valid UTF-8 debate we had last time, and I am
> not trying to re-open it.)
> I think that the second paragraph above should be swapped with the first one.
> 
>   1.  Replace the problematic item with an error marker and continue
>       with the next item, or
> 
> -> this might be a place where that desired "invalid tag" from last week's
> discussion fits in!!!
> 
> Having read through section 5, I believe even more than two weeks ago, that
> the "65535" tag should go into RFC7049bis, not a new document.

We didn’t want new stuff in 7049bis, so this is now in draft-bormann-cbor-notable-tags with a mention here; RFC7049bis combines with its registries to give the full semantics of CBOR, so this is OK.

> 5.3.1 "Duplicate keys in a map" seems to suggest that "Stream Encoder" will
> be specified/discussed in section 5.6, when in fact that section is about
> keys, including duplicate keys.  Maybe 5.3.1/Basic Validity could go later,
> or just not be said at all, since the entire section is about this topic?
> 
> I think that section 7.1 has a lot of aspirational language ("an attempt
> should..."), which might have been appropriate for the ID that let to 7049,
> but SHOULD now be definitive.  "An attempt was made to make..."

Section 7 may need another fine combing, indeed.

Grüße, Carsten

[Cbor] RFC7049bis processing of unknown tags Michael Richardson
Re: [Cbor] RFC7049bis processing of unknown tags Carsten Bormann
Re: [Cbor] RFC7049bis processing of unknown tags Laurence Lundblade
Re: [Cbor] RFC7049bis processing of unknown tags Jeffrey Yasskin
Re: [Cbor] RFC7049bis processing of unknown tags Michael Richardson
Re: [Cbor] RFC7049bis processing of unknown tags Jeffrey Yasskin
Re: [Cbor] RFC7049bis processing of unknown tags Laurence Lundblade
[Cbor] my (WGLC re-)views on error processing in … Michael Richardson
Re: [Cbor] my (WGLC re-)views on error processing… Carsten Bormann
Re: [Cbor] my (WGLC re-)views on error processing… Michael Richardson
Re: [Cbor] my (WGLC re-)views on error processing… Carsten Bormann
Re: [Cbor] my (WGLC re-)views on error processing… Michael Richardson
Re: [Cbor] my (WGLC re-)views on error processing… Carsten Bormann
Re: [Cbor] RFC7049bis processing of unknown tags Carsten Bormann
Re: [Cbor] my (WGLC re-)views on error processing… Carsten Bormann
Re: [Cbor] my (WGLC re-)views on error processing… Jeffrey Yasskin
Re: [Cbor] my (WGLC re-)views on error processing… Carsten Bormann
Re: [Cbor] my (WGLC re-)views on error processing… Michael Richardson
Re: [Cbor] my (WGLC re-)views on error processing… Laurence Lundblade
Re: [Cbor] my (WGLC re-)views on error processing… Carsten Bormann
[Cbor] tag 24 and 55799 (was Re: my (WGLC re-)vie… Laurence Lundblade
Re: [Cbor] tag 24 and 55799 (was Re: my (WGLC re-… Carsten Bormann
Re: [Cbor] tag 24 and 55799 (was Re: my (WGLC re-… Laurence Lundblade
Re: [Cbor] tag 24 and 55799 (was Re: my (WGLC re-… Carsten Bormann