Re: [Cbor] my (WGLC re-)views on error processing in RFC7049bis and future-proofing

Carsten Bormann <cabo@tzi.org> Wed, 20 May 2020 14:49 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6828E3A0A69 for <cbor@ietfa.amsl.com>; Wed, 20 May 2020 07:49:58 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id p2dLrCioAemF for <cbor@ietfa.amsl.com>; Wed, 20 May 2020 07:49:55 -0700 (PDT)
Received: from gabriel-vm-2.zfn.uni-bremen.de (gabriel-vm-2.zfn.uni-bremen.de [134.102.50.17]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C339F3A0A0E for <cbor@ietf.org>; Wed, 20 May 2020 07:49:54 -0700 (PDT)
Received: from [192.168.217.119] (p548dc699.dip0.t-ipconnect.de [84.141.198.153]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gabriel-vm-2.zfn.uni-bremen.de (Postfix) with ESMTPSA id 49RwdV2TglzySM; Wed, 20 May 2020 16:49:50 +0200 (CEST)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.80.23.2.2\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <2963.1589473899@localhost>
Date: Wed, 20 May 2020 16:49:49 +0200
Cc: cbor@ietf.org
X-Mao-Original-Outgoing-Id: 611678989.4689569-7cb80b15a8e40de33e658a2dc6d12dde
Content-Transfer-Encoding: quoted-printable
Message-Id: <377E8232-0638-419F-8D79-710F42C2B4E3@tzi.org>
References: <17300.1588779159@localhost> <38BB6FFF-737F-4C11-AD7A-DA3F28A9F570@tzi.org> <CANh-dXkdjMyO=WFUxrF06OfP+RE9v11unKJXL8P3UtEe+prV1w@mail.gmail.com> <13690.1588894939@localhost> <CANh-dXmjD=RCwh7ExjSvFx+5ciew+eqHoVS88OommQ2xVnX5=Q@mail.gmail.com> <2963.1589473899@localhost>
To: Michael Richardson <mcr+ietf@sandelman.ca>
X-Mailer: Apple Mail (2.3608.80.23.2.2)
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/chOdNxz3pZoZ5ArP7jE-mkTTjSY>
Subject: Re: [Cbor] my (WGLC re-)views on error processing in RFC7049bis and future-proofing
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 20 May 2020 14:49:59 -0000

Hi Michael,

This is the second version of this reply.
Items that I think we have handled are removed.


> Jeffrey Yasskin <jyasskin@chromium.org> wrote:
>>> I am a user of parsers, I have occasionally had to write my own
>>> conversions, but mostly I would say that I am not up to speed on some
>>> of the details.
> 
>> That's all reasonable. :) Given the discussion, and Lawrence's quick
>> analysis of the existing tags, how do you currently feel about the
>> state of RFC7049bis's requirements around error handling?
> 
> Maybe time for a top-to-bottom read. Probably appropriate thing to do during WGLC :-)
> 
> Intro does say: "It does not create a new version of the format."
> 
> I think that I'd like to amend this to say that:
> 
>  This document is a revised edition of [RFC7049], with editorial
>  improvements, added detail, and fixed errata.  This revision formally
>  obsoletes RFC 7049, while keeping full compatibility of the
>  interchange format from RFC 7049.  It does not create a new version
>  of the format.
> 
> ->
>  This document is a revised edition of [RFC7049], with editorial
>  improvements, added detail, and fixed errata.  In clarifying some
>  interpretations of [RFC7049] it may in some cases create situations where
>  an existing parser may no longer comply to this specification.
>  While this revision formally obsoletes RFC 7049, it does not obsolete
>  any valid encoders, and thus keeps full compatibility with the
>  interchange format from RFC 7049.  It does not create a new version
>  of the format.

I don’t think we want to go to that level of detail right there in the intro.
I’m also not sure 7049 was formally defining “compliance” of decoders.

I think we could move some of this thinking into the “changes from 7049” appendix, the fate of which we haven’t really decided as a WG yet.

> I note this text (section 2.1):
> 
>  While there is a strong expectation that generic encoders and
>  decoders can represent "false", "true", and "null" ("undefined" is
>  intentionally omitted) in the form appropriate for their programming
>  environment, implementation of the data model extensions created by
>  tags is truly optional and a matter of implementation quality.
> 
> This seems to have something to do with tags.

Yes.
Implementation of specific tags was always optional with CBOR; there is no single “required tag”.

> Also section 2.2 ends with:
> 
>  "0.0" as an integer (major type 0, Section 3.1).  However, if a
>  specific data model declares that floating-point and integer
>  representations of integral values are equivalent, using both map
>  keys "0" and "0.0" in a single map would be considered duplicates,
>  even while encoded as different major types, and so invalid; and an
>  encoder could encode integral-valued floats as integers or vice
>  versa, perhaps to save encoded bytes.
> 
> To me, this says that all generic encoders that intend to return data in the
> native form of their programming environment need to be configured as to the
> protocol.  This is supporting my suggestion that a well designed library
> would/could have to be configured for the specific data model when it comes
> to how unknown tags are treated.

I’d say the implementation of the specific data model has to be configured for the library...

>> * I'm not opposed to advice for parsers to have an option to treat a
>> value tagged with an unknown tag as equivalent to the value itself.
> 
>> * I dislike the idea of any generic parser doing that by default, I think
>> based on reasoning like in
>> https://tools.ietf.org/html/draft-iab-protocol-maintenance-04.  * If a
>> parser passes unknown tags up to the application, a higher-level
>> protocol can ignore them itself, skip their data item, return an error,
>> or do something else appropriate to the context. So if the RFC should
>> recommend a default in generic parsers, I'd vote for that one.  * I
>> don't intend to draft this wording myself. :)
> 
> I think that I'm arguing for generic parsers to be in RFC7049 mode,

(I don’t think that is a thing.)

> which
> might mean ignoring unknown tags (if that's what they did before), passing
> the data up using whatever native interpretation there is, until they are
> configured otherwise.

Generic decoders that want to evolve to be usable with applications that need tag support will need to develop a transition strategy.  Isn’t that obvious to any library developer?

> That is, if there is a seconds-since-epoch tag (XXX) which a generic parser
> did not understand, followed by an integer, that it would return an integer
> if it did not have an interface that passed tags.

That generic decoder would not be useful for an application that needs to be able to handle both integers and time tags in the same position.  That doesn’t change.

[Skipping the 28, 29, 30 part, because I think #186 addressed it.]

> In general, I think that the details in this introductionary encoding section
> are too detailed, particularly for 31.  I think that detail belongs later
> on. I got no value (I retained nothing) from having that level of detail there.

I think there was another proposal to move around elements of Section 3.
Sometimes it is necessary to include some detail in an overview for completeness; we can’t really pretend ai=31 does not exist for a few sections and then do a surprise reveal, can we?

> I wonder if section 3.1, under major type 0 should give clarify that "0"
> is encoded as 0b000_00000. (That is no negative 0)

Is this really more about major type 1, and the value -1?

>  "A string containing an invalid UTF-8 sequence is well-
>   formed but invalid."
> 
> I think that this might need clarification.

Not for a pedantic like me who has already memorized the content of Section 1.2…

> I guess that RFC8742 include sequences of 7049bis CBOR sequences.
> I wonder if Updates 8742 is appropriate.

You still lost me here.
What are 7049 CBOR sequences?

Oh.  We are talking about “Data Streams”.  This probably should mention RFC 8742 (which only happens in 3.1)!

> 
>> If the break stop code appears after a key in a map, in place of that
>> key's value, the map is not well-formed.
> 
> This does mean that the entire map is not well-formed, or just the
> key/value pair where this occurs?  I take the first meaning, but I want to be
> sure.

It says that the map is not well-formed, but of course the whole data item is dubious as it is not clear whether the map has ended or not.  So, again, give up.  Should be clear in Appendix C.

> 3.2.3:
>  (Note that zero-length
>  chunks, while not particularly useful, are permitted.)
> 
> they might be useful in non-TCP/IP situations where it is useful to send a
> "keep-alive" on some channel.

We haven’t addressed general padding in CBOR (which would again require a 1.1 or 2.0 maybe), and I would hate to suggest this here as the only padding that CBOR already offers.

> I think that it might be cleaner to swap the order of sections 3.2 (infinite
> length things), and 3.3 (floating-point and stuff).  This just puts major
> type 7 more in context first.
> 
>> As with all other major types, the 5-bit value 24 signifies a single-
>> byte extension: it is followed by an additional byte to represent the
>> simple value.  (To minimize confusion, only the values 32 to 255 are
>> used.)  This maintains the structure of the initial bytes: as for the
> 
> Or, to put it another way, 5-bit Values 24->31  in table 3 are also "Simple
> Values".    

24, yes, 25 to 27 no (floating point), and 28 to 30 are reserved.
(31 is the break stop code and not a value at all.)

> Could future Simple Values (such as 0..19) can, have complex
> structure the way that values 24->27 do?

No, the general syntax of heads does apply to the unallocated code points as well.

> Or to put it another way, can a decoder depend upon unassigned simple values
> having the one-or-two byte structure presented and be able to skip unknown
> values?  

Yes.

> Or does a decoder that encounters undefined values here have to
> fail?

No:

5.2.: Generic encoders and decoders are
  expected to forward simple values and tags even if their specific
  codepoints are not registered at the time the encoder/decoder is
  written (Section 5.4).

>> formed.  (This implies that an encoder cannot encode false, true,
>> null, or undefined in two-byte sequences, only the one-byte variants
>> of these are well-formed.)
> 
> I here suggest the text say:
> 
> +   formed.  (This implies that an encoder cannot encode false, true,
> +   null, floats, undefined-23, reserved-[28..31], or break in two-byte
> +   sequences, only the one-byte variants of these are well-formed.)

The implication we give doesn’t have to be exhaustive.  The sentence you propose is very confusing because it is at a different level: head syntax, but that is not the issue here.  The text here is about the choice that the general head syntax seems to leave for simple values 0..23, which we are saying it actually doesn’t.

> While it's too late to change, was there a reason "True" didn't get 0b111_00001?

Yes.  An earlier version of CBOR used simple values 0..15 for some form of compression.
So we allocated the other simple values at the top, not at the bottom.

> Clearly False, and Null would then compete to be 0b111_00000, and maybe
> that's reason enough to not play such games.

Right.

> 
> Section 3.4 says:
> 
> }   Their primary purpose in this specification is to define common data
> }   types such as dates.  A secondary purpose is to provide conversion
> }   hints when it is foreseen that the CBOR data item needs to be
> }   translated into a different format, requiring hints about the content
> }   of items.
> 
> I don't think that the "primary purpose" is still just dates.

Dates were an example here (and we might soon have a data tag :-).

> The note about "hints" suggests that tags are always advisory,

In the second case — does it really suggest that for the data types?

> and I think
> that this thread has established that for some protocols, they really are
> not.
> 
> }   Understanding the semantics of tags is optional for a
> }   decoder; it can simply present both the tag number and the tag
> }   content to the application, without interpreting the additional
> }   semantics of the tag.
> 
> I wonder if this text should be stronger.
> Maybe:
> 
> +   Understanding the semantics of every tag is optional for a decoder;

each, actually (I prefer not to spell that out).

> +   a decoder MAY simply present some or all tags to the
> +   application, without interpreting the additional
> +   semantics of the tag.

I don’t like interoperability keywords in such places.

> I would then go on:
> +   Decoders which translate CBOR values into language specific objects,
> +   (e.g., dates, bignum, example3, ...) MAY consume the tags along with
> +   the values, returning only the language defined object.
> 
> I note that AFAIK, we do not use tag#24 (Encoded CBOR data item) for the
> signed object, in COSE.  Should we?
> What's the difference between #24 and #55799.

55799 is a tag that can have any CBOR data item as tag content
24 is a tag that can only be on byte strings.
The byte string then *encodes* another CBOR data item.
(The main use here is to keep the decoder from decoding, to provide easy skip-ability or because we need exact bytes as in COSE.)
As often with tags, there is no need for tag 24 on a byte string when it is clear from context that the byte string contains encoded CBOR; this is the case in COSE.

> I guess I will read onwards to find out... Got it.
> 
> BTW: Tag 25 and 29 are called out after Table 5, but are not listed *in* table 5.
> That whole paragraph could use some more periods, and maybe a blank line.
> I'm still loss as to why <untagged><null> is better than <epoch><null>.
> 
> Why can't we use decimal fractions, or bigfloats for time?

That may have been a mistake (which is one reason we have tag 1001 now).
The WG has generally taken a dim view on extending the domain (allowable syntax for tag content) for a tag, so we can’t “fix” that — note that for the date tag, we have taken the decision not to reuse one tag for two different tag content syntaxes either.

> I suppose float64 has enough precision for a millenia or so, even if one
> wants microseconds precision.

It has 53 bits of precision, so it gives ~ microseconds (~ 20 bits after the decimal point) for 2**33 seconds or 272 years from the epoch or until 2242.  Four microseconds until 3058.

> I think that words "bytewise lexicographic order" used in 4.2.1 may not
> survive translations in a meaningful way.

Deepl turns this into "byteweise lexikographische Ordnung”, na ja.
The next alternative "byteweise lexikographische Reihenfolge” is very close.
No idea about "ordre lexicographique par octet”.
Or "пошаговый лексикографический порядок”, for that matter.
But "字节词序” looks really good :-) (and is completely wrong).

>  The eight item example might be
> clearer if presented in a table so that the bytes can be lined up.
> If keys of a map have tags, I assume that the tags are to be included in the
> lexicographically order?  Maybe 4.2.3 could cover both ways with a forward reference?

Well, 4.2.3 is the old (7049) ordering.

> I think that 4.2.2 gets into whether or not a tag is required, and I think
> that it might need to be considered more in context of when tags can be
> skipped.

But it is about deterministic encoding, not about that other discussion…

> I think that the the Introduction should have a section 1.3 that addresses
> the concept of "Protocols" on top of CBOR, referencing section 5.
> I think that 4.2.2 should forward reference to 5, or maybe sections 4 and 5
> I suggest "protocol" be capitalized consistently as Protocol when it is used
> in this way.
> 
> I don't find that section 5.2 fits into section 5.
> I think we already covered this concept.

This section reaffirms to concept developed above.  It also discusses the application interface — have we really covered that here?

> 
>  "0x62c0ae" does not contain valid UTF-8 and so is not a valid CBOR
>  item.  ......
>  Generic encoders and decoders are
>  expected to forward simple values and tags even if their specific
>  codepoints are not registered at the time the encoder/decoder is
>  written (Section 5.4).
> 
>  Generic decoders provide ways to present well-formed CBOR values,
>  both valid and invalid, to an application.  The diagnostic notation
>  (Section 8) may be used to present well-formed CBOR values to humans.
> 
> I don't personally know enough UTF-8 to know why the above is invalid UTF-8.

Because it uses a long form where there is a shorter form - UTF-8 uses deterministic (shortest) encoding.

> Maybe saying that it's not because c0ae is an unsigned code point by because...

This is a mild reminder that you have to read up on UTF-8 to understand CBOR.

> (I remember the valid MIME and valid UTF-8 debate we had last time, and I am
> not trying to re-open it.)
> I think that the second paragraph above should be swapped with the first one.

Well, that would expose that the paragraphs are partially redundant …

Now https://github.com/cbor-wg/CBORbis/issues/189

>  1.  Replace the problematic item with an error marker and continue
>      with the next item, or
> 
> -> this might be a place where that desired "invalid tag" from last week's
> discussion fits in!!!

Well, hmm.  The tag is invalid in interchange to allow the implementation to do with it what it wants, so, yes, you are right, but that wasn’t the intention for its use.

> Having read through section 5, I believe even more than two weeks ago, that
> the "65535" tag should go into RFC7049bis, not a new document.

We didn’t want new stuff in 7049bis, so this is now in draft-bormann-cbor-notable-tags with a mention here; RFC7049bis combines with its registries to give the full semantics of CBOR, so this is OK.

> 5.3.1 "Duplicate keys in a map" seems to suggest that "Stream Encoder" will
> be specified/discussed in section 5.6, when in fact that section is about
> keys, including duplicate keys.  Maybe 5.3.1/Basic Validity could go later,
> or just not be said at all, since the entire section is about this topic?

It sure needs to be said (and the term introduced).
The section 5.6 reference needs to be pulled out of the last sentence.

Now https://github.com/cbor-wg/CBORbis/pull/190

> 
> I think that section 7.1 has a lot of aspirational language ("an attempt
> should..."), which might have been appropriate for the ID that let to 7049,
> but SHOULD now be definitive.  "An attempt was made to make..."

Section 7 may need another fine combing, indeed (and now has some comb marks in #187).

Grüße, Carsten