[Cbor] my (WGLC re-)views on error processing in RFC7049bis and future-proofing

Michael Richardson <mcr+ietf@sandelman.ca> Thu, 14 May 2020 16:31 UTC

Return-Path: <mcr+ietf@sandelman.ca>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 361AF3A0BD6 for <cbor@ietfa.amsl.com>; Thu, 14 May 2020 09:31:50 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id qxdYCPv0Nush for <cbor@ietfa.amsl.com>; Thu, 14 May 2020 09:31:45 -0700 (PDT)
Received: from tuna.sandelman.ca (tuna.sandelman.ca [209.87.249.19]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A443D3A0BBF for <cbor@ietf.org>; Thu, 14 May 2020 09:31:45 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1]) by tuna.sandelman.ca (Postfix) with ESMTP id 5EFEB389D8 for <cbor@ietf.org>; Thu, 14 May 2020 12:29:38 -0400 (EDT)
Received: from tuna.sandelman.ca ([127.0.0.1]) by localhost (localhost [127.0.0.1]) (amavisd-new, port 10024) with LMTP id AqhLuQxqmm89 for <cbor@ietf.org>; Thu, 14 May 2020 12:29:34 -0400 (EDT)
Received: from sandelman.ca (obiwan.sandelman.ca [IPv6:2607:f0b0:f:2::247]) by tuna.sandelman.ca (Postfix) with ESMTP id 08F693899A for <cbor@ietf.org>; Thu, 14 May 2020 12:29:34 -0400 (EDT)
Received: from localhost (localhost [IPv6:::1]) by sandelman.ca (Postfix) with ESMTP id D9F1D213 for <cbor@ietf.org>; Thu, 14 May 2020 12:31:39 -0400 (EDT)
From: Michael Richardson <mcr+ietf@sandelman.ca>
To: cbor@ietf.org
In-Reply-To: <CANh-dXmjD=RCwh7ExjSvFx+5ciew+eqHoVS88OommQ2xVnX5=Q@mail.gmail.com>
References: <17300.1588779159@localhost> <38BB6FFF-737F-4C11-AD7A-DA3F28A9F570@tzi.org> <CANh-dXkdjMyO=WFUxrF06OfP+RE9v11unKJXL8P3UtEe+prV1w@mail.gmail.com> <13690.1588894939@localhost> <CANh-dXmjD=RCwh7ExjSvFx+5ciew+eqHoVS88OommQ2xVnX5=Q@mail.gmail.com>
X-Mailer: MH-E 8.6+git; nmh 1.7+dev; GNU Emacs 26.1
X-Face: $\n1pF)h^`}$H>Hk{L"x@)JS7<%Az}5RyS@k9X%29-lHB$Ti.V>2bi.~ehC0; <'$9xN5Ub# z!G,p`nR&p7Fz@^UXIn156S8.~^@MJ*mMsD7=QFeq%AL4m<nPbLgmtKK-5dC@#:k
MIME-Version: 1.0
Content-Type: multipart/signed; boundary="=-=-="; micalg="pgp-sha512"; protocol="application/pgp-signature"
Date: Thu, 14 May 2020 12:31:39 -0400
Message-ID: <2963.1589473899@localhost>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/QOk_hrJoF8NcuiorkeXex9mUH4w>
Subject: [Cbor] my (WGLC re-)views on error processing in RFC7049bis and future-proofing
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 14 May 2020 16:31:50 -0000

Carsten and Paul, my appologies for the length of this email.
I've been working on this an hour or so a day for a week :-(
I will probably reply and emphasis my important points.

This document says it is going to Internet Standard, but *only* in the Contributing section.
Please also forgive me that I haven't re-read RFC7049 in a long while, so I
might be complaining about things that are just matching 7049.

Jeffrey Yasskin <jyasskin@chromium.org> wrote:
    >> I am a user of parsers, I have occasionally had to write my own
    >> conversions, but mostly I would say that I am not up to speed on some
    >> of the details.

    > That's all reasonable. :) Given the discussion, and Lawrence's quick
    > analysis of the existing tags, how do you currently feel about the
    > state of RFC7049bis's requirements around error handling?

Maybe time for a top-to-bottom read. Probably appropriate thing to do during WGLC :-)

Intro does say: "It does not create a new version of the format."

I think that I'd like to amend this to say that:

   This document is a revised edition of [RFC7049], with editorial
   improvements, added detail, and fixed errata.  This revision formally
   obsoletes RFC 7049, while keeping full compatibility of the
   interchange format from RFC 7049.  It does not create a new version
   of the format.

->
   This document is a revised edition of [RFC7049], with editorial
   improvements, added detail, and fixed errata.  In clarifying some
   interpretations of [RFC7049] it may in some cases create situations where
   an existing parser may no longer comply to this specification.
   While this revision formally obsoletes RFC 7049, it does not obsolete
   any valid encoders, and thus keeps full compatibility with the
   interchange format from RFC 7049.  It does not create a new version
   of the format.


I note this text (section 2.1):

   While there is a strong expectation that generic encoders and
   decoders can represent "false", "true", and "null" ("undefined" is
   intentionally omitted) in the form appropriate for their programming
   environment, implementation of the data model extensions created by
   tags is truly optional and a matter of implementation quality.

This seems to have something to do with tags.

Also section 2.2 ends with:

   "0.0" as an integer (major type 0, Section 3.1).  However, if a
   specific data model declares that floating-point and integer
   representations of integral values are equivalent, using both map
   keys "0" and "0.0" in a single map would be considered duplicates,
   even while encoded as different major types, and so invalid; and an
   encoder could encode integral-valued floats as integers or vice
   versa, perhaps to save encoded bytes.

To me, this says that all generic encoders that intend to return data in the
native form of their programming environment need to be configured as to the
protocol.  This is supporting my suggestion that a well designed library
would/could have to be configured for the specific data model when it comes
to how unknown tags are treated.

    > * I'm not opposed to advice for parsers to have an option to treat a
    > value tagged with an unknown tag as equivalent to the value itself.

    > * I dislike the idea of any generic parser doing that by default, I think
    > based on reasoning like in
    > https://tools.ietf.org/html/draft-iab-protocol-maintenance-04.  * If a
    > parser passes unknown tags up to the application, a higher-level
    > protocol can ignore them itself, skip their data item, return an error,
    > or do something else appropriate to the context. So if the RFC should
    > recommend a default in generic parsers, I'd vote for that one.  * I
    > don't intend to draft this wording myself. :)

I think that I'm arguing for generic parsers to be in RFC7049 mode, which
might mean ignoring unknown tags (if that's what they did before), passing
the data up using whatever native interpretation there is, until they are
configured otherwise.

That is, if there is a seconds-since-epoch tag (XXX) which a generic parser
did not understand, followed by an integer, that it would return an integer
if it did not have an interface that passed tags.

>   28, 29, 30:  These values are reserved for future additions to the
>      CBOR format.  In the present version of CBOR, the encoded item is
>      not well-formed.

I think that there is a bug here.
What should a parser written today do when it encounters these values?
(forward reference to section 7.2?)
Getting this right is how we deal with future-proofing.
It seems seeing such a thing means a current decoder has to abort/fail.
What we write here has a profound implication, I think, on how easily we
could act on the advice of section 7.2.  Section 10, first paragraph implies
we should say something.

In general, I think that the details in this introductionary encoding section
are too detailed, particularly for 31.  I think that detail belongs later
on. I got no value (I retained nothing) from having that level of detail there.

I wonder if section 3.1, under major type 0 should give clarify that "0"
is encoded as 0b000_00000. (That is no negative 0)

   "A string containing an invalid UTF-8 sequence is well-
    formed but invalid."

I think that this might need clarification.

I guess that RFC8742 include sequences of 7049bis CBOR sequences.
I wonder if Updates 8742 is appropriate.

>   If the break stop code appears after a key in a map, in place of that
>   key's value, the map is not well-formed.

This does mean that the entire map is not well-formed, or just the
key/value pair where this occurs?  I take the first meaning, but I want to be
sure.

3.2.3:
   (Note that zero-length
   chunks, while not particularly useful, are permitted.)

they might be useful in non-TCP/IP situations where it is useful to send a
"keep-alive" on some channel.

I think that it might be cleaner to swap the order of sections 3.2 (infinite
length things), and 3.3 (floating-point and stuff).  This just puts major
type 7 more in context first.

>   As with all other major types, the 5-bit value 24 signifies a single-
>   byte extension: it is followed by an additional byte to represent the
>   simple value.  (To minimize confusion, only the values 32 to 255 are
>   used.)  This maintains the structure of the initial bytes: as for the

Or, to put it another way, 5-bit Values 24->31  in table 3 are also "Simple
Values".    Could future Simple Values (such as 0..19) can, have complex
structure the way that values 24->27 do?

Or to put it another way, can a decoder depend upon unassigned simple values
having the one-or-two byte structure presented and be able to skip unknown
values?  Or does a decoder that encounters undefined values here have to
fail?

>   formed.  (This implies that an encoder cannot encode false, true,
>   null, or undefined in two-byte sequences, only the one-byte variants
>   of these are well-formed.)

I here suggest the text say:

+   formed.  (This implies that an encoder cannot encode false, true,
+   null, floats, undefined-23, reserved-[28..31], or break in two-byte
+   sequences, only the one-byte variants of these are well-formed.)

While it's too late to change, was there a reason "True" didn't get 0b111_00001?
Clearly False, and Null would then compete to be 0b111_00000, and maybe
that's reason enough to not play such games.

Section 3.4 says:

}   Their primary purpose in this specification is to define common data
}   types such as dates.  A secondary purpose is to provide conversion
}   hints when it is foreseen that the CBOR data item needs to be
}   translated into a different format, requiring hints about the content
}   of items.

I don't think that the "primary purpose" is still just dates.
The note about "hints" suggests that tags are always advisory, and I think
that this thread has established that for some protocols, they really are
not.

}   Understanding the semantics of tags is optional for a
}   decoder; it can simply present both the tag number and the tag
}   content to the application, without interpreting the additional
}   semantics of the tag.

I wonder if this text should be stronger.
Maybe:

+   Understanding the semantics of every tag is optional for a decoder;
+   a decoder MAY simply present some or all tags to the
+   application, without interpreting the additional
+   semantics of the tag.

I would then go on:
+   Decoders which translate CBOR values into language specific objects,
+   (e.g., dates, bignum, example3, ...) MAY consume the tags along with
+   the values, returning only the language defined object.

I note that AFAIK, we do not use tag#24 (Encoded CBOR data item) for the
signed object, in COSE.  Should we?
What's the difference between #24 and #55799.
I guess I will read onwards to find out... Got it.

BTW: Tag 25 and 29 are called out after Table 5, but are not listed *in* table 5.
That whole paragraph could use some more periods, and maybe a blank line.
I'm still loss as to why <untagged><null> is better than <epoch><null>.

Why can't we use decimal fractions, or bigfloats for time?
I suppose float64 has enough precision for a millenia or so, even if one
wants microseconds precision.

I think that words "bytewise lexicographic order" used in 4.2.1 may not
survive translations in a meaningful way.  The eight item example might be
clearer if presented in a table so that the bytes can be lined up.
If keys of a map have tags, I assume that the tags are to be included in the
lexicographically order?  Maybe 4.2.3 could cover both ways with a forward reference?

I think that 4.2.2 gets into whether or not a tag is required, and I think
that it might need to be considered more in context of when tags can be
skipped.

I think that the the Introduction should have a section 1.3 that addresses
the concept of "Protocols" on top of CBOR, referencing section 5.
I think that 4.2.2 should forward reference to 5, or maybe sections 4 and 5
I suggest "protocol" be capitalized consistently as Protocol when it is used
in this way.

I don't find that section 5.2 fits into section 5.
I think we already covered this concept.

   "0x62c0ae" does not contain valid UTF-8 and so is not a valid CBOR
   item.  ......
   Generic encoders and decoders are
   expected to forward simple values and tags even if their specific
   codepoints are not registered at the time the encoder/decoder is
   written (Section 5.4).

   Generic decoders provide ways to present well-formed CBOR values,
   both valid and invalid, to an application.  The diagnostic notation
   (Section 8) may be used to present well-formed CBOR values to humans.

I don't personally know enough UTF-8 to know why the above is invalid UTF-8.
Maybe saying that it's not because c0ae is an unsigned code point by because...
(I remember the valid MIME and valid UTF-8 debate we had last time, and I am
not trying to re-open it.)
I think that the second paragraph above should be swapped with the first one.

   1.  Replace the problematic item with an error marker and continue
       with the next item, or

-> this might be a place where that desired "invalid tag" from last week's
discussion fits in!!!

Having read through section 5, I believe even more than two weeks ago, that
the "65535" tag should go into RFC7049bis, not a new document.

5.3.1 "Duplicate keys in a map" seems to suggest that "Stream Encoder" will
be specified/discussed in section 5.6, when in fact that section is about
keys, including duplicate keys.  Maybe 5.3.1/Basic Validity could go later,
or just not be said at all, since the entire section is about this topic?

I think that section 7.1 has a lot of aspirational language ("an attempt
should..."), which might have been appropriate for the ID that let to 7049,
but SHOULD now be definitive.  "An attempt was made to make..."




--
Michael Richardson <mcr+IETF@sandelman.ca>, Sandelman Software Works
 -= IPv6 IoT consulting =-