[Cbor] Re: draft-ietf-cbor-edn-literals-10 implementation notes

Carsten Bormann <cabo@tzi.org> Sat, 17 August 2024 11:40 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A120AC14F6AA for <cbor@ietfa.amsl.com>; Sat, 17 Aug 2024 04:40:39 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.898
X-Spam-Level:
X-Spam-Status: No, score=-1.898 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01, T_SPF_TEMPERROR=0.01] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id TSseJEGIpqUd for <cbor@ietfa.amsl.com>; Sat, 17 Aug 2024 04:40:33 -0700 (PDT)
Received: from smtp.zfn.uni-bremen.de (smtp.zfn.uni-bremen.de [IPv6:2001:638:708:32::21]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1653FC14F60B for <cbor@ietf.org>; Sat, 17 Aug 2024 04:40:31 -0700 (PDT)
Received: from clients-pool1-0074.vpn.uni-bremen.de (clients-pool1-0074.vpn.uni-bremen.de [134.102.107.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4WmH3b68r8zDCbN; Sat, 17 Aug 2024 13:40:27 +0200 (CEST)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.7\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <FFD7DB6F-2313-4A64-B434-110B1C524D0C@cursive.net>
Date: Sat, 17 Aug 2024 13:40:27 +0200
X-Mao-Original-Outgoing-Id: 745587627.408378-0c645ed824d8cb6db44442482ddb0c39
Content-Transfer-Encoding: quoted-printable
Message-Id: <4DDB6F09-D038-4A13-80F3-A4C13F3EF529@tzi.org>
References: <FFD7DB6F-2313-4A64-B434-110B1C524D0C@cursive.net>
To: Joe Hildebrand <hildjj@cursive.net>
X-Mailer: Apple Mail (2.3608.120.23.2.7)
Message-ID-Hash: KYZSHVKXEXZPDZCOZ6PWDW6VOEB7STPW
X-Message-ID-Hash: KYZSHVKXEXZPDZCOZ6PWDW6VOEB7STPW
X-MailFrom: cabo@tzi.org
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-cbor.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: CBOR <cbor@ietf.org>
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [Cbor] Re: draft-ietf-cbor-edn-literals-10 implementation notes
List-Id: "Concise Binary Object Representation (CBOR)" <cbor.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/b9zuxpGT67NwJ6tOcWQ3AQzIh_Q>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Owner: <mailto:cbor-owner@ietf.org>
List-Post: <mailto:cbor@ietf.org>
List-Subscribe: <mailto:cbor-join@ietf.org>
List-Unsubscribe: <mailto:cbor-leave@ietf.org>

Hi Joe,

thank you for this extensive feedback.

I have structured my response into several sections

> - Overall, this spec is in pretty good shape.  Aside from the app-strings, I was able to implement directly from the ABNF using my standard tooling.  Reading the actual text of the spec answered most of my questions.

Good.  Of course, I’d like to know the complement of the “most” — I assume there will be something below on some of these items.

## Proposals for technical changes

> - I wish that floating point encodings had a separate spec syntax from integers, rather than relying on getting a decimal point or "e" into the output somehow (e.g. JS doesn't have good float formatting built in).  For example, 2_f1 could mean 0xf94000, while 2_1 would mean 0x190002.

We have used diagnostic notation in a different way for a long time.
So I’d consider this a technical change.

The specific change suggested here violates an invariant of encoding indicators: You can leave them off and will still get an equivalent data item from a data model point of view.  

(It did take us some time to learn enough from implementations to evolve the CBOR data model from a muddy boundary between integer and float in RFC 7049 to a much more well-defined boundary that we now have in RFC 8949.)

I would probably solve the specific use case by two application-oriented extension literals:

f'2' is equivalent to 2.0, not 2
i'1e3' is equivalent to 1000, not 1000.0

> - s4.1 It should be an error to mix app-strings and strings, or app-strings of multiple types.  The text at the end of the section starting with "Some of the strings may be app-strings..." is not proscriptive enough.

This is a reasonable conclusion when you just look at the set of application-extensions that are in the document, but it breaks down when constructing strings from components that can be described by app-strings (something like a JWT that mixes base64-encoded strings with delimiters such as “.” and “~”).
(The example would need to build the base64-encoded parts using application-oriented literals, for which we don’t have an application-oriented literal — b64 goes the other way around.  I find it quite likely that we’ll have application-oriented literals for JWT and CWT over time.)

> - It's not clear to me what happens if there are multiple items in the sequence inside <<>>.  I assume they are concatenated together, even though that's a little odd unless you are generating cbor streams.  I would have expected the production to be:
> 
>    embedded = "<<" one-item ">>"

(Current ABNF is:

    embedded        = "<<" seq ">>"

where seq is a CBOR sequence.)

This syntax for Embedded CBOR sequences is quite explicitly part of Appendix G.3 of RFC 8610.
Embedded CBOR sequences are a natural way to employ the fact that the byte string encoding already provides an outer delineation, so an array head is superfluous.  I’ve seen them used in various contexts but don’t have an example handy off the top of my head.

> You would still get concatenation with << 1 >> + << 2 >> in the unlikely event you need it.

I would consider needing to write this (instead of << 1, 2 >>) noise.

## Bugs

> - s3.2, "Herewith I buy" /.../ "gned: Alice & Bob" doesn't match the grammar.  I think it needs a +.


Good point.
(Mental note: I need to run the examples through the newest edn-abnf…)

I propose to simply remove the “contract” member of the example, as it is no longer really a compelling way of using comments.

Now:
https://github.com/cbor-wg/edn-literal/pull/58

## Exposition

### ABNF

> - Like others, I don't like the two-level ABNF.  I understand that you're going for extensibility, but you can still leave an extension point in place while having a single grammar.

As an implementers’ note for an implementation that just implements the extensions and constructs that already are in the document, I can follow that view (even if it works out differently in not only my implementation).
Extensibility, however, is important, and I think I have made my point why a little more work for some initial implementations will be worth it for extensibility.
Maybe an implementers’ note always will take the perspective of an implementer, not one of caring about protocol evolution; but some implementers have found that the two-level approach is also easier to implement (and more robust, to boot).

> - (nit) I'd prefer basenumber to be split into 3 or 4 rules, one for each base, since each needs special processing.

I personally like the more compact way, but I’ll put up a change proposal in a PR.

### Document structure

> - The doc structure is quite odd in that it presents the extension points before the main format.  I understand that's an outgrowth of how the doc grew over time, but it needs a small refactor before publishing.  I'm willing to provide more suggestions or help, if the authors want.

One comment on another recent document essentially was that we should do a complete roll-up of the documents that constitute our ecosystem — we already did a roll-up for CBOR (RFC 8949) but could may revise that with a target publication date of 2027 (another seven years).  
I’d rather do the editorial work then.  
Right now the document relies on the reader having read Section 8 of 8949 and Appendix G of 8610; so focusing on the additions first sounds acceptable to me.

> - s3 could be moved after the ABNF section and the (possibly new) app-string section and make more sense.

I need to re-read that to have an opinion.
(Moving the ABNF from an appendix to a main section strikes back.)

It seems to me a forward reference from Section 3 to the rule “app-string” in Section 4.1 would also do it.

> - s4.2 could be outdented to s5, containing both ABNF and the descriptions of the app-string formats from s2.  Having to go back and forth made reading more difficult than it needed to be.

Makes sense on a napkin.  I’ll try to do a PR.

> - s4.2.1, h'/head/ 63 /contents/ 66 6f 6f' should become << "cfoo" >>, not << "foo" >> if I'm understanding correctly.

“foo” is a text string of length 3, so it begins with 0x63, followed by 66 6f 6f for the f o o characters.
“cfoo” is a text string of length 4, so it begins with 0x64, followed by 63 66 6f 6f.

$ echo "h'/head/ 63 /contents/ 66 6f 6f'" | edn-abnf -tdiag - | diag2pretty.rb
44          # bytes(4)
   63666f6f # "cfoo"
$ echo '<< "foo" >>' | edn-abnf -tdiag - | diag2pretty.rb
44          # bytes(4)
   63666f6f # "cfoo"
$ echo '<< "cfoo" >>' | edn-abnf -tdiag - | diag2pretty.rb
45            # bytes(5)
   6463666f6f # “dcfoo"
$ echo "h'/head/ 63 /contents/ 66 6f 6f'" | edn-abnf -tdiag - | diag2diag.rb -e
<< "foo" >>
$ echo '<< "foo" >>' | edn-abnf -tdiag - | diag2diag.rb -e
<< "foo" >>
$ echo '<< "cfoo" >>' | edn-abnf -tdiag - | diag2diag.rb -e
<< "cfoo" >>
(Embedded CBOR can be confusing because of the multiple layers, why it was a great addition to get syntax for that in Appendix G.3 of RFC 8610.)

> - s6 Security Considerations seems like it could use some more text about how this format isn't intended for interchange.

Some text (or just key words) would be appreciated.

> I'm building up a large-ish set of test vectors.  I'm willing to put those into a separate repo for sharing if anyone is interested in collaborating.

Absolutely!
I have a very lazy set of tests (specifically testing out recent changes) as a CSV (first column = for equivalent and - for not equivalent, second column EDN, third column also EDN but different features):

https://github.com/cabo/edn-abnf/blob/explicit-concatenation/tests/basic.csv

(This CSV contains carriage returns in some of the CSV-quoted strings; this abuse of CSV might not survive a Windows environment, but the tests will magically still pass :-).

(This could have been EDN plus hex encoded CBOR data items, but it is so much easier to write the latter in EDN, too…)

> I'm not quite done with the implementation, so there are likely to be a few more comments as I continue to dig.

Looking forward to them!

Grüße, Carsten