Re: [Cbor] my (WGLC re-)views on error processing in RFC7049bis and future-proofing

Michael Richardson <mcr+ietf@sandelman.ca> Thu, 21 May 2020 23:10 UTC

Return-Path: <mcr@sandelman.ca>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6944C3A0C92 for <cbor@ietfa.amsl.com>; Thu, 21 May 2020 16:10:53 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 5gq76MA1KYNt for <cbor@ietfa.amsl.com>; Thu, 21 May 2020 16:10:48 -0700 (PDT)
Received: from relay.sandelman.ca (relay.cooperix.net [176.58.120.209]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C09BC3A0C50 for <cbor@ietf.org>; Thu, 21 May 2020 16:10:47 -0700 (PDT)
Received: from dooku.sandelman.ca (unknown [IPv6:2607:f0b0:f:40:c840:256e:f31:f2f9]) by relay.sandelman.ca (Postfix) with ESMTPS id 16C191F449; Thu, 21 May 2020 23:10:45 +0000 (UTC)
Received: by dooku.sandelman.ca (Postfix, from userid 179) id 15AC21A329C; Thu, 21 May 2020 19:10:44 -0400 (EDT)
From: Michael Richardson <mcr+ietf@sandelman.ca>
To: Carsten Bormann <cabo@tzi.org>
cc: cbor@ietf.org
In-reply-to: <377E8232-0638-419F-8D79-710F42C2B4E3@tzi.org>
References: <17300.1588779159@localhost> <38BB6FFF-737F-4C11-AD7A-DA3F28A9F570@tzi.org> <CANh-dXkdjMyO=WFUxrF06OfP+RE9v11unKJXL8P3UtEe+prV1w@mail.gmail.com> <13690.1588894939@localhost> <CANh-dXmjD=RCwh7ExjSvFx+5ciew+eqHoVS88OommQ2xVnX5=Q@mail.gmail.com> <2963.1589473899@localhost> <377E8232-0638-419F-8D79-710F42C2B4E3@tzi.org>
Comments: In-reply-to Carsten Bormann <cabo@tzi.org> message dated "Wed, 20 May 2020 16:49:49 +0200."
X-Mailer: MH-E 8.6; nmh 1.7+dev; GNU Emacs 25.2.1
MIME-Version: 1.0
Content-Type: multipart/signed; boundary="=-=-="; micalg="pgp-sha512"; protocol="application/pgp-signature"
Date: Thu, 21 May 2020 19:10:44 -0400
Message-ID: <4347.1590102644@dooku>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/ygLIwBnsuy6wvkZxE9TENYtrECE>
Subject: Re: [Cbor] my (WGLC re-)views on error processing in RFC7049bis and future-proofing
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 21 May 2020 23:10:54 -0000

Carsten Bormann <cabo@tzi.org> wrote:
    >> This document is a revised edition of [RFC7049], with editorial
    >> improvements, added detail, and fixed errata.  In clarifying some
    >> interpretations of [RFC7049] it may in some cases create situations
    >> where an existing parser may no longer comply to this specification.
    >> While this revision formally obsoletes RFC 7049, it does not obsolete
    >> any valid encoders, and thus keeps full compatibility with the
    >> interchange format from RFC 7049.  It does not create a new version of
    >> the format.

    > I don’t think we want to go to that level of detail right there in the
    > intro.  I’m also not sure 7049 was formally defining “compliance” of
    > decoders.

    > I think we could move some of this thinking into the “changes from
    > 7049” appendix, the fate of which we haven’t really decided as a WG
    > yet.

okay, I can go for that.

    >> This seems to have something to do with tags.

    > Yes.  Implementation of specific tags was always optional with CBOR;
    > there is no single “required tag”.

yes, okay, but some Protocols might require them.
So maybe we need some additional BCP14-like terms. Hmm.

    >> which might mean ignoring unknown tags (if that's what they did
    >> before), passing the data up using whatever native interpretation
    >> there is, until they are configured otherwise.

    > Generic decoders that want to evolve to be usable with applications
    > that need tag support will need to develop a transition strategy.
    > Isn’t that obvious to any library developer?

Yes, but sometimes application developers need a hammer.
This happens far less often in open source situations, but in other
situations, the hammer is sometimes necessary to make a supplier spend the
effort.

    >> In general, I think that the details in this introductionary encoding
    >> section are too detailed, particularly for 31.  I think that detail
    >> belongs later on. I got no value (I retained nothing) from having that
    >> level of detail there.

    > I think there was another proposal to move around elements of Section
    > 3.  Sometimes it is necessary to include some detail in an overview for
    > completeness; we can’t really pretend ai=31 does not exist for a few
    > sections and then do a surprise reveal, can we?

We can just say less in section 3.  ai=31 "Stop-Code, ses section FOO"
That's okay.  It's some of the other detail that seems to accumulated there
which distracted me.

    >> I wonder if section 3.1, under major type 0 should give clarify that
    >> "0" is encoded as 0b000_00000. (That is no negative 0)

    > Is this really more about major type 1, and the value -1?

maybe.

    >> I guess that RFC8742 include sequences of 7049bis CBOR sequences.  I
    >> wonder if Updates 8742 is appropriate.

    > You still lost me here.  What are 7049 CBOR sequences?

    > Oh.  We are talking about “Data Streams”.  This probably should mention
    > RFC 8742 (which only happens in 3.1)!

Yes.

    >>> If the break stop code appears after a key in a map, in place of that
    >>> key's value, the map is not well-formed.
    >>
    >> This does mean that the entire map is not well-formed, or just the
    >> key/value pair where this occurs?  I take the first meaning, but I
    >> want to be sure.

    > It says that the map is not well-formed, but of course the whole data
    > item is dubious as it is not clear whether the map has ended or not.
    > So, again, give up.  Should be clear in Appendix C.

I am concerned that different kinds of parsers wind up resulting in an error
at a different point, possibly causing some content to be examined, while
other content is not.   This is akin to the duplicate key challenge.

    >> Could future Simple Values (such as 0..19) can, have complex structure
    >> the way that values 24->27 do?

    > No, the general syntax of heads does apply to the unallocated code
    > points as well.

I'd like to say this.

    >> Or to put it another way, can a decoder depend upon unassigned simple
    >> values having the one-or-two byte structure presented and be able to
    >> skip unknown values?

    > Yes.

    >> I guess I will read onwards to find out... Got it.
    >>
    >> BTW: Tag 25 and 29 are called out after Table 5, but are not listed
    >> *in* table 5.  That whole paragraph could use some more periods, and
    >> maybe a blank line.  I'm still loss as to why <untagged><null> is
    >> better than <epoch><null>.
    >>
    >> Why can't we use decimal fractions, or bigfloats for time?

    > That may have been a mistake (which is one reason we have tag 1001
    > now).  The WG has generally taken a dim view on extending the domain
    > (allowable syntax for tag content) for a tag, so we can’t “fix” that —
    > note that for the date tag, we have taken the decision not to reuse one
    > tag for two different tag content syntaxes either.

okay.

    >> I think that words "bytewise lexicographic order" used in 4.2.1 may
    >> not survive translations in a meaningful way.

    > Deepl turns this into "byteweise lexikographische Ordnung”, na ja.  The
    > next alternative "byteweise lexikographische Reihenfolge” is very
    > close.  No idea about "ordre lexicographique par octet”.  Or "пошаговый
    > лексикографический порядок”, for that matter.  But "字节词序” looks
    > really good :-) (and is completely wrong).

I'm less concerned about translations of the draft by mechanical, but
translations by human who didn't understand the original mathematical
meaning.
Whether that person is a translator, or a non-native english/germanic/latin
speaker writing code.  I'm just asking if you could use a less technical
term.  HIT ME ON THE HEAD.

    >> I think that the the Introduction should have a section 1.3 that
    >> addresses the concept of "Protocols" on top of CBOR, referencing
    >> section 5.  I think that 4.2.2 should forward reference to 5, or maybe
    >> sections 4 and 5 I suggest "protocol" be capitalized consistently as
    >> Protocol when it is used in this way.
    >>
    >> I don't find that section 5.2 fits into section 5.  I think we already
    >> covered this concept.

    > This section reaffirms to concept developed above.  It also discusses
    > the application interface — have we really covered that here?

I don't understand your comments here.

    >> "0x62c0ae" does not contain valid UTF-8 and so is not a valid CBOR
    >> item.  ......  Generic encoders and decoders are expected to forward
    >> simple values and tags even if their specific codepoints are not
    >> registered at the time the encoder/decoder is written (Section 5.4).
    >>
    >> Generic decoders provide ways to present well-formed CBOR values, both
    >> valid and invalid, to an application.  The diagnostic notation
    >> (Section 8) may be used to present well-formed CBOR values to humans.
    >>
    >> I don't personally know enough UTF-8 to know why the above is invalid
    >> UTF-8.

    > Because it uses a long form where there is a shorter form - UTF-8 uses
    > deterministic (shortest) encoding.

Can you please say that in the document.

} For instance, "0x62c0ae" uses a long form where there is a shorter form, and
} UTF-8 mandates deterministic (shortest) encoding, therefore, this not a valid
} CBOR item.

    >> Maybe saying that it's not because c0ae is an unsigned code point by
    >> because...

    > This is a mild reminder that you have to read up on UTF-8 to understand
    > CBOR.

Yes, but many will just plop the text into a UTF-8 parser and hope.
We can't all be experts at everything :-)

    >> Having read through section 5, I believe even more than two weeks ago,
    >> that the "65535" tag should go into RFC7049bis, not a new document.

    > We didn’t want new stuff in 7049bis, so this is now in
    > draft-bormann-cbor-notable-tags with a mention here; RFC7049bis
    > combines with its registries to give the full semantics of CBOR, so
    > this is OK.

I feel that this is a reasonable thing to address short-cuming in existng PS,
but I won't fall on this sword.

    > Grüße, Carsten

Thank you for all the work.

--
]               Never tell me the odds!                 | ipv6 mesh networks [
]   Michael Richardson, Sandelman Software Works        | network architect  [
]     mcr@sandelman.ca  http://www.sandelman.ca/        |   ruby on rails    [



--
Michael Richardson <mcr+IETF@sandelman.ca>, Sandelman Software Works
 -= IPv6 IoT consulting =-