[Cbor] Re: draft-ietf-cbor-edn-literals-10 implementation notes

Carsten Bormann <cabo@tzi.org> Sat, 17 August 2024 16:58 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2749EC14F683 for <cbor@ietfa.amsl.com>; Sat, 17 Aug 2024 09:58:24 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.909
X-Spam-Level:
X-Spam-Status: No, score=-6.909 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6cFTOua-WXdO for <cbor@ietfa.amsl.com>; Sat, 17 Aug 2024 09:58:21 -0700 (PDT)
Received: from smtp.zfn.uni-bremen.de (smtp.zfn.uni-bremen.de [IPv6:2001:638:708:32::21]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D2E3BC14F681 for <cbor@ietf.org>; Sat, 17 Aug 2024 09:58:19 -0700 (PDT)
Received: from clients-pool1-0074.vpn.uni-bremen.de (clients-pool1-0074.vpn.uni-bremen.de [134.102.107.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4WmQ6K1F99zDCd1; Sat, 17 Aug 2024 18:58:17 +0200 (CEST)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.7\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <2E46E3B5-6F88-49D0-AD75-10CF611BBFB5@cursive.net>
Date: Sat, 17 Aug 2024 18:58:15 +0200
X-Mao-Original-Outgoing-Id: 745606695.033637-77ee74875f142b948c6d55a04ef25725
Content-Transfer-Encoding: quoted-printable
Message-Id: <B6B33BBA-2844-4FBF-B93E-9BC63A76CD47@tzi.org>
References: <FFD7DB6F-2313-4A64-B434-110B1C524D0C@cursive.net> <4DDB6F09-D038-4A13-80F3-A4C13F3EF529@tzi.org> <2E46E3B5-6F88-49D0-AD75-10CF611BBFB5@cursive.net>
To: Joe Hildebrand <hildjj@cursive.net>
X-Mailer: Apple Mail (2.3608.120.23.2.7)
Message-ID-Hash: 4DWEWEMQ7KSHV4GI2JOGEPWG42AQGG3E
X-Message-ID-Hash: 4DWEWEMQ7KSHV4GI2JOGEPWG42AQGG3E
X-MailFrom: cabo@tzi.org
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-cbor.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: CBOR <cbor@ietf.org>
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [Cbor] Re: draft-ietf-cbor-edn-literals-10 implementation notes
List-Id: "Concise Binary Object Representation (CBOR)" <cbor.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/MNSM92CohuL03kX1-hD32nZFOYQ>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Owner: <mailto:cbor-owner@ietf.org>
List-Post: <mailto:cbor@ietf.org>
List-Subscribe: <mailto:cbor-join@ietf.org>
List-Unsubscribe: <mailto:cbor-leave@ietf.org>

Hi Joe,

thank for the quick turnaround.

>> f'2' is equivalent to 2.0, not 2
>> i'1e3' is equivalent to 1000, not 1000.0
> 
> Nod.  After thinking about it some more, I don't think it's important to add.  Let me try a couple of things I thought of overnight before we add this complexity.

Thanks.

> 
>> 
>>> - s4.1 It should be an error to mix app-strings and strings, or app-strings of multiple types.  The text at the end of the section starting with "Some of the strings may be app-strings..." is not proscriptive enough.
>> 
>> This is a reasonable conclusion when you just look at the set of application-extensions that are in the document, but it breaks down when constructing strings from components that can be described by app-strings (something like a JWT that mixes base64-encoded strings with delimiters such as “.” and “~”).
>> (The example would need to build the base64-encoded parts using application-oriented literals, for which we don’t have an application-oriented literal — b64 goes the other way around.  I find it quite likely that we’ll have application-oriented literals for JWT and CWT over time.)
> 
> Can we at least say that the output type of the app-strings need to be compatible enough to concatenate?

The last bullet in 4.1 tries to do that.

> Take your 'f' approach from above; what would f'2' + f'3' mean?

"	• Some of the strings may be app-strings. If the type of the app-string is an actual string, joining of chunked strings occurs as with directly notated strings; otherwise the occurrence of more than one app-string or an app-string together with a directly notated string cannot be processed.”

I read the answer as “cannot be processed”, which was a way to say “not allowed”.  Wording fixes appreciated.

>>> - It's not clear to me what happens if there are multiple items in the sequence inside <<>>.  I assume they are concatenated together, even though that's a little odd unless you are generating cbor streams.  I would have expected the production to be:
>>> 
>>>  embedded = "<<" one-item ">>"
>> 
>> (Current ABNF is:
>> 
>>   embedded        = "<<" seq ">>"
>> 
>> where seq is a CBOR sequence.)
>> 
>> This syntax for Embedded CBOR sequences is quite explicitly part of Appendix G.3 of RFC 8610.
>> Embedded CBOR sequences are a natural way to employ the fact that the byte string encoding already provides an outer delineation, so an array head is superfluous.  I’ve seen them used in various contexts but don’t have an example handy off the top of my head.
> 
> OK.  It would be nice to have just a few words in this spec about the concatenation, then.  Particularly because the commas are now optional unlike in 8610.

I would like to avoid the term concatenation, as we are not concatenating data-model level strings here, but we are assembling CBOR sequences (sequences of encoded data items).  This is briefly described in Section 4.2 of RFC 8742; this doesn’t outright extend EDN, but is the basis for allowing seq at the top level of the current ABNF.

>> https://github.com/cbor-wg/edn-literal/pull/58
> 
> +1.

OK.  I’ll leave the PR out there for a little more time and merge it then.

>> ## Exposition
>> 
>> ### ABNF
>> 
>>> - Like others, I don't like the two-level ABNF.  I understand that you're going for extensibility, but you can still leave an extension point in place while having a single grammar.
>> 
>> As an implementers’ note for an implementation that just implements the extensions and constructs that already are in the document, I can follow that view (even if it works out differently in not only my implementation).
>> Extensibility, however, is important, and I think I have made my point why a little more work for some initial implementations will be worth it for extensibility.
>> Maybe an implementers’ note always will take the perspective of an implementer, not one of caring about protocol evolution; but some implementers have found that the two-level approach is also easier to implement (and more robust, to boot).
> 
> I really don't want to crank up an entire other parser instance for each app-
> string.  

I actually do that in my code, but I think I wouldn’t have to, as the parser instances can be re-used.

> There's not a *lot* of overhead, but there's enough that it will become noticeable, I think.  Therefore, I'm going to interpolate the app-string grammars into my larger grammar.  I hope I do it in a way that is similar enough to everyone else that it will interoperate and still allow extensibility correctly.  Maybe I'll get it right.

I think that is the point of the discussion about keeping some of PR # 49 around, either as an alternative (e.g., in an appendix) or as some wiki.

>>> - (nit) I'd prefer basenumber to be split into 3 or 4 rules, one for each base, since each needs special processing.
>> 
>> I personally like the more compact way, but I’ll put up a change proposal in a PR.
> 
> It's not worth the change if others had tools available to them to process 0x 0o and 0b strings (including hex floats) all in one place.  Maybe I could have gotten all of them except hexfloats if I'd been willing to use eval().  If we are going to make a change for this, I'd suggest hexfloats be separated from hexints as well, since they are more likely to need custom code.

Done in an update to #58.
As I wrote, my code is now trivial, if a bit more repetitive.

>> ### Document structure
>> 
>>> - The doc structure is quite odd in that it presents the extension points before the main format.  I understand that's an outgrowth of how the doc grew over time, but it needs a small refactor before publishing.  I'm willing to provide more suggestions or help, if the authors want.
>> 
>> One comment on another recent document essentially was that we should do a complete roll-up of the documents that constitute our ecosystem — we already did a roll-up for CBOR (RFC 8949) but could may revise that with a target publication date of 2027 (another seven years).  
>> I’d rather do the editorial work then.  
>> Right now the document relies on the reader having read Section 8 of 8949 and Appendix G of 8610; so focusing on the additions first sounds acceptable to me.
> 
> I *didn't* read 8610 as I sat down to implement this, and didn't understand that it was important to do so from the text.  I started work on a CDDL implementation months ago, but never got very far.  Nevertheless, I expected CDDL to be completely unrelated to implementing diagnostic notation,

It is.  It just happened to a document that we were completing at the time we decided to write up what became the E in EDN.
I’ll look for opportunities to clarify this.

> and assumed I didn't need that doc.  In particular, section 1 says:
> 
> "This document sets forth a further step of evolution of EDN, and it is intended to serve as a single reference target in specifications that use EDN."

A single reference target doesn’t imply that this needs to be without references…

>>> - s3 could be moved after the ABNF section and the (possibly new) app-string section and make more sense.
>> 
>> I need to re-read that to have an opinion.
>> (Moving the ABNF from an appendix to a main section strikes back.)
>> 
>> It seems to me a forward reference from Section 3 to the rule “app-string” in Section 4.1 would also do it.
> 
> Let's agree on the previous point before tackling this.
> 
>> (Embedded CBOR can be confusing because of the multiple layers, why it was a great addition to get syntax for that in Appendix G.3 of RFC 8610.)
> 
> If the comment was a little more than /head/ maybe it would have been more clear?  Perhaps /string, length 3/?

Good idea.  Will make PR.

>>> - s6 Security Considerations seems like it could use some more text about how this format isn't intended for interchange.
>> 
>> Some text (or just key words) would be appreciated.
> 
> This probably needs to be a separate thread.  What I'm most worried about is that we haven't thought of the ways that you could mangle input to a parser constructed from the ABNF to get surprising results.  JSON is small enough that we think we've worked through those boundaries (although we did find the issue with U+2028 and U+2029 relatively late).
> 
> Maybe we should at least do some fuzzing of the ABNF.

Doing an abnfgen also is quite illuminating — haven’t done that in a while...
> 
>>> I'm building up a large-ish set of test vectors.  I'm willing to put those into a separate repo for sharing if anyone is interested in collaborating.
>> 
>> Absolutely!
>> I have a very lazy set of tests (specifically testing out recent changes) as a CSV (first column = for equivalent and - for not equivalent, second column EDN, third column also EDN but different features):
>> 
>> https://github.com/cabo/edn-abnf/blob/explicit-concatenation/tests/basic.csv
>> 
>> (This CSV contains carriage returns in some of the CSV-quoted strings; this abuse of CSV might not survive a Windows environment, but the tests will magically still pass :-).
>> 
>> (This could have been EDN plus hex encoded CBOR data items, but it is so much easier to write the latter in EDN, too…)
> 
> Thanks, I'll take a look.  I have a bunch of strings that are supposed to cause failures.  

The current CSV represents a failure as 

-,mybadedn,

(i.e., “-“ for not OK, and leave the third column empty).

> Those probably should be in a separate file.  

I like them together with the related succeeding cases.

> I'll send PRs.

Looking forward!

Grüße, Carsten