Re: [Cbor] CDDL parsing questions

Carsten Bormann <cabo@tzi.org> Fri, 19 August 2022 08:02 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6FCFAC14CE3F for <cbor@ietfa.amsl.com>; Fri, 19 Aug 2022 01:02:15 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.896
X-Spam-Level:
X-Spam-Status: No, score=-1.896 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01, T_SPF_TEMPERROR=0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 0Me4f4KQT0As for <cbor@ietfa.amsl.com>; Fri, 19 Aug 2022 01:02:08 -0700 (PDT)
Received: from gabriel-smtp.zfn.uni-bremen.de (gabriel-smtp.zfn.uni-bremen.de [134.102.50.15]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B08A2C14CF14 for <cbor@ietf.org>; Fri, 19 Aug 2022 01:02:08 -0700 (PDT)
Received: from [192.168.217.149] (p5089abf5.dip0.t-ipconnect.de [80.137.171.245]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gabriel-smtp.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4M8Dk51t08zDCg6; Fri, 19 Aug 2022 10:02:05 +0200 (CEST)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.7\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <Yv8vaeVShMFNJ9IL@faui48e.informatik.uni-erlangen.de>
Date: Fri, 19 Aug 2022 10:02:04 +0200
Cc: Derek Atkins <derek@ihtfp.com>, cbor@ietf.org
X-Mao-Original-Outgoing-Id: 682588924.552832-e3f3cb6ca9bf6e1fc036b2aa5fcf59d6
Content-Transfer-Encoding: quoted-printable
Message-Id: <23CD3F15-80AA-4B72-97B9-A11CC486D07D@tzi.org>
References: <Yv13HuFndByI/TtZ@faui48e.informatik.uni-erlangen.de> <2d9abb4cff288213ee021bfb5d57f5a6.squirrel@mail2.ihtfp.org> <Yv4XtKqLUrto4f/c@faui48e.informatik.uni-erlangen.de> <76F35EDA-ADAE-49D2-BEB0-15B73CAC0A39@tzi.org> <Yv8vaeVShMFNJ9IL@faui48e.informatik.uni-erlangen.de>
To: Toerless Eckert <tte@cs.fau.de>
X-Mailer: Apple Mail (2.3608.120.23.2.7)
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/faWtXAxT6e25EWz5vIkmcwzr8xI>
Subject: Re: [Cbor] CDDL parsing questions
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 19 Aug 2022 08:02:15 -0000

Hi Toerless,

You are trying to on-the-fly create a processing model for CDDL.
This requires a bit more thinking and critical design.

>> I prefer to reserve the term “parsing” to text-based protocols that are best handled with parser generators.
> 
> Except for "text-based protocols", that was exactly what i was thinking of - if
> i correctly understand you:
> 
> "CDDL parser:"

A program that parses CDDL.  Not something that happens on a constrained device at run time.  Example: [1]

[1]: https://www.ietf.org/archive/id/draft-bormann-cbor-cddl-freezer-09.html#name-alternative-representations

> A program that creates from CDDL input a program which takes a CBOR input
> and spits out a structure (tree?) of CDDL names, each pointing to the "parsed"
> CDDL structures that it represents.

That would be a “validator generator”, with some concept of PSVI (post-schema-validation instance) that includes annotation by rule names.
A very specific processing model that may not at all fit all sizes.

> CVE ?

https://en.wikipedia.org/wiki/Common_Vulnerabilities_and_Exposures

>> What you seem to be alluding to, is the ingestion of a CBOR data item in CBOR generic data model into the semantic categories that application wants.  CDDL can describe some, but not all of this process.
> 
> My main point is not even about implementation, but see above.
> 
> My main point is that if we use CDDL to specify protocol structures with CDDL
> names that we then need an agreement about what it means for protocol input/output
> to comply with that CDDL specification or not.

That depends on what you said in the specification.
Typically, the specification would require certain messages to match certain CDDL rules.
(Either by making the CDDL normative or by saying there is a normative English language specification that is then in practice made comprehensible and disambiguated by the CDDL, just like we have been using ABNF in many cases.)

> To me, that is the case if i
> could have the above "CDDL parser" and it would take my CBOR protocol structure input
> and attach the CDDL name to it that i think that CBOR protocol structure represents.

This presumes a processing model that may not be common to all implementations of a protocol; i.e., this would be overspecification.

> Not really any different whether i specify in CDDL or in ASCII-art, only
> that i think we never philosophized about the process of determining whether or
> not a protocol structure is compliant with the specification - because we
> intuitively/from-experience always choose to define protocol structures
> simple enough that we didn't have much to discuss.

We didn’t have to, because “Protocol message matches CDDL rule” is a very straightforward rule.

>> CDDL can be used to write complex grammars that require more look-ahead than one would like to have, e.g.
>> 
>> Message = Message1 / Message2
>> 
>> Message1 = [foo, bar, 1]
>> 
>> Message2 = [foo, bar, 2]
>> 
>> Don’t do that.
> 
> Exactly. This is what i think our "CDDL protocol" in question does, or
> at least would do if we went down that path, and hence this mailing list thread.

But this is a quality-of-specification discussion, not a what-does-the-specification-mean question.

>> (A tool that flags excessive look-ahead requirements would be useful.
>> In this case, putting the discriminator up front is helpful:
>> Message1 = [1, foo, bar]
>> Message2 = [2, foo, bar]
> 
> Exactly. But IMHO that is ONLY necessary/benefical if we do have a good
> definition as to what "CDDL protocols" can and which ones can't afford this lookahead
> 
>  good-protocol = [This, is, a, lovely, protocol, ",", dear]
>  bad-protocol  = [This, is, a, lovely, protocol, ",", idiot]

We don’t have to (and, really, can’t) define a quality threshold.
Having a spec that also explains quality aspects such as avoiding unnecessary look-ahead would be useful.
This could be done in conjunction with a more details examination of processing models that I would expect to be part of the CDDL 2.0 effort.

> For on-the-wire-protocols i don't think i have ever seen this "lookahead",

I’m pretty sure I’ve had to look into parameters inside TLVs to categorize the things that had those parameters; this is a form of look-ahead.

> but in programming and human language parsers it is of course common.
> 
> So now i fundamentally start to wonder if we're not missing out on wonderful
> world of richer, and for some reason better syntax in on-the-wire protocols
> solely because previously we designed on-the-wire protocols primarily
> so that "hand-written-parsers" could be easy, whereas those human/computer-language
> parsers already went way beyond that layer of the problem and had "automated"
> the parsing, hence achieving far more flexible syntax.

CBOR was designed to enable a very, very automatic lexical level (generic decoder), and to allow the whole gamut of approaches to the syntactic (structural) level, including hand-coding, code generation, and generation of parameters for a more generic (“table-driven”) syntactic decoder.  Also, the two levels can be munched up if that helps (it usually does with memory allocation).

> But then of course i go back and ask: what is the most simple _good_
> example why we would want to do lookahead. Right now our protocol in question
> answer to me is a bit "we forgot to avoid lookahead in our original design,
> and when we now want to extend the protocol with maximum backward
> compatibility, we create lookahead". But i am nor persuaded that this is
> agood-enough reason.

I don’t think this can be answered in a general way.

Grüße, Carsten