Re: [Cbor] packed review (was: Re: Reminder and call for agenda: CBOR WG Virtual Meeting on 2022-06-01)

Christian Amsüss <christian@amsuess.com> Wed, 29 June 2022 07:55 UTC

Return-Path: <christian@amsuess.com>
X-Original-To: cbor@ietfa.amsl.com
Delivered-To: cbor@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6BF16C13A236 for <cbor@ietfa.amsl.com>; Wed, 29 Jun 2022 00:55:36 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.909
X-Spam-Level:
X-Spam-Status: No, score=-1.909 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Kj9dkaYy9LTT for <cbor@ietfa.amsl.com>; Wed, 29 Jun 2022 00:55:34 -0700 (PDT)
Received: from smtp.akis.at (smtp.akis.at [IPv6:2a02:b18:500:a515::f455]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C2B94C13A228 for <cbor@ietf.org>; Wed, 29 Jun 2022 00:55:30 -0700 (PDT)
Received: from poseidon-mailhub.amsuess.com (095129206250.cust.akis.net [95.129.206.250]) by smtp.akis.at (8.17.1/8.17.1) with ESMTPS id 25T7tPPP038020 (version=TLSv1.2 cipher=ECDHE-ECDSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 29 Jun 2022 09:55:26 +0200 (CEST) (envelope-from christian@amsuess.com)
X-Authentication-Warning: smtp.akis.at: Host 095129206250.cust.akis.net [95.129.206.250] claimed to be poseidon-mailhub.amsuess.com
Received: from poseidon-mailbox.amsuess.com (poseidon-mailbox.amsuess.com [IPv6:2a02:b18:c13b:8010:a800:ff:fede:b1bf]) by poseidon-mailhub.amsuess.com (Postfix) with ESMTP id 991AD951F; Wed, 29 Jun 2022 09:55:23 +0200 (CEST)
Received: from hephaistos.amsuess.com (unknown [IPv6:2a02:b18:c13b:8010:297c:eccd:d672:21c1]) by poseidon-mailbox.amsuess.com (Postfix) with ESMTPSA id 45539D113; Wed, 29 Jun 2022 09:55:23 +0200 (CEST)
Received: (nullmailer pid 66302 invoked by uid 1000); Wed, 29 Jun 2022 07:55:22 -0000
Date: Wed, 29 Jun 2022 09:55:22 +0200
From: Christian Amsüss <christian@amsuess.com>
To: Carsten Bormann <cabo@tzi.org>
Cc: cbor@ietf.org
Message-ID: <YrwFarI3RecBziqP@hephaistos.amsuess.com>
References: <CALaySJLPtUjdfVss17noK=18RyczpcCGNu=im8CBpiQz=WiLWA@mail.gmail.com> <CALaySJKUNh-AkJa87sCDpzf9OHV8H367VQyzyozXCCXxphUARw@mail.gmail.com> <CALaySJ+P2sP7BU7bNSxRJBByyp04rzVZuukq_e+9wbb5WPRSFQ@mail.gmail.com> <CALaySJKxht1gd1+3mNiAH-kLUAxjdPPk3doK50C_xS74LG+YTQ@mail.gmail.com> <CALaySJJjSHT2q_wpZQ9QFhLSxGuhffWwb=9P1XDUFTsheOvPZA@mail.gmail.com> <5A9B396E-1D9F-455C-949F-9B4C89AA510C@tzi.org> <CALaySJ+Sp=hmc4-kp1UrYPf0BxMtQy4aS+LiCfkREYqmip1Q6w@mail.gmail.com> <B9E21E1E-164D-4306-88D7-A88DC76080A9@tzi.org> <YrmJV7OwrbOI/zKe@hephaistos.amsuess.com> <861D3104-8C33-4819-9488-1885F94C973F@tzi.org>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg="pgp-sha256"; protocol="application/pgp-signature"; boundary="NeWG5yrkwqU58SDq"
Content-Disposition: inline
In-Reply-To: <861D3104-8C33-4819-9488-1885F94C973F@tzi.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/jRXtciYssm5wNf0AsswHeVO0Cok>
Subject: Re: [Cbor] packed review (was: Re: Reminder and call for agenda: CBOR WG Virtual Meeting on 2022-06-01)
X-BeenThere: cbor@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: "Concise Binary Object Representation \(CBOR\)" <cbor.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/cbor>, <mailto:cbor-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cbor/>
List-Post: <mailto:cbor@ietf.org>
List-Help: <mailto:cbor-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/cbor>, <mailto:cbor-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Jun 2022 07:55:36 -0000

Hello Carsten,

On Tue, Jun 28, 2022 at 10:44:50PM +0200, Carsten Bormann wrote:
> >> When reconstructing the original data item, such a reference is
> >> replaced by a data item constructed from the argument data item found
> >> in the table (argument, which might need to be recursively unpacked
> >> first) and the rump data item (rump, again possibly recursively
> >> unpacked).
> > 
> > This reads a bit like also the rump would be recursively unpacked first.
> > Is that the intention? (I think not -- with my understanding of
> > unpacking so far, the rump would be unpacked at the time it is
> > encountered in the reconstructed item, allowing the argument to alter
> > the dictionary).
> 
> You lost me at “alter the dictionary”.
> 
> The argument *is* in the dictionary.
> The rump needs to be “recursively” (we need to fix this term) unpacked because any parameters to the function tag (or to the affixing process) need to be unpacked (conceptually!) before they are applied.
> In a real implementation, the unpacking could be done by a visiting accessor, so there is no actual recursion.

If the rump is conceptionally unpacked before whatever is in the
dictionary is applied to it, then constructions in which the argument
data item would alter the dictionary are ruled out. That might be OK in
the interest of simplicity or even mental sanity, but it means that any
tags that alter the dictionary always need to be spelled out at some
cost.

In particular, this means that when tree-shaped dictionaries[1] are
used, a packed document that utilizes a particular branch would always
look like 5151([6(30), n1, n2, rump]) and not like 254([n1, n2, rump])
-- because while in the former the rump could use the tags set up, the
dictionary alteration expanded from 254 would have no effect.

(Where 254 would be a type-0 tag, defined in that context to to prepend
[6(30)] to the rump array and wrap the resulting array in 5151, and 5151
is some dictionary setup described in [1]).

[1]: https://github.com/core-wg/coral/pull/28

> >  * giving names to the two arguments the of the function (arg1, arg2?
> >    A, B? tagged, other?)
> 
> Left hand side, right hand side.

Sounds good.

> > but I'm unconvinced that that symmetry is worth the
> > cognitive / implementation load of doing it that way,
> 
> It makes sense to point out on which end the function tag is (we found
> good examples for both ends).  Of course, we could just take the end
> which has a tag, but that makes things a bit less deterministic (and
> becomes complicated if both ends have one).  I’d rather be explicit on
> where the function tag is, and the same bit that decides prefix vs.
> suffix can also be used to decide argument-tag vs. rump-tag.  The
> “type-0”/“type-1” wording is scary for about 20 minutes :-)

Being explicit is good, I just have a hard time seeing use cases for the
dominating tag being around the rump.

(Coming back to this more precisely in the next paragraph)

> > * Use a bit more separate tags for where there is real need for having
> >  both directions. (No type-0/type-1 distinction, and having the
> >  tag-that-indicates-function always in the dictionary. For string/array
> >  postfixes that'd need a tag already at dictionary setup).
>
> I need an example.
> 
> "having the tag-that-indicates-function always in the dictionary.” Has
> the disadvantage that you cannot use the same table entry string for
> midfix-from-rump and affix, as in my example.

I don't think that that is a common enough case to warrant the resulting
complexity -- especially when not embracing general function evaluation
(my proposal would limit these tags to dictionary positions, which I
think also simplifies implementations[^1]).

In an example like the "packed.example" URIs you give, we'd like to use
a value both as suffix and midfix-from-rump, there can just be two
dictionary entries for the two cases. This is compact in the setup and
in use:

113([[], ["packed.example", 1109(216("") / "packed.example" /)],
  [
  217(["https://", "/foo.html"]),
  217(["coap://", "/bar.cbor"])),
  216("mailto:support@"),
  ]])

What it does cost us -- that's the "use a bit more separate tags" part
-- is that we'd now need one tag 109 that expands to
Midfix(head_and_tail=arg, middle=rump) and one more 1109 that expands to
Midfix(head_and_tail=rump, middle=arg). These would be the
"argument-swapped" versions of each other, and we'd only need them
pairwise for functions in which both directions make sense.

In return, we get to use the space of reference tags flatly,
pay-as-you-use style. (If you use 32 arguments, thereof 8 with two
different tags, you get precisely the same 40 2-byte references as there
are with the type system. But if you use 40 arguments all with different
(or no) tags, you get to use 40 of them, and if you use 20 arguments
each with 2 tags, you can also fit them in the 2-byte reference space).

[^1]: Thinking of implementations that only support application-provided
dictionaries that are known at compile time. Having the dominating tag
always in the dictionary means that they don't even have to be
implemented in terms of dominating tags, but the argument table in the
implemenmtation can really be a pair of (function pointer, data) items
that correspond to the tag and tagged-value, roughly. The implementer
can then provide a large number of tag functions, but will only link in
those actually used in the dictionary, reducing the need for application
profiling.

> >  In packing, they'd be used as
> > 
> >  ```
> >  113([[],
> >    [109(["packed.example", ARG])],
> >    [6(["https://", "/foo.html"]),
> >     ...
> >    ]
> >  ])
> >  ```
> 
> What is ARG?

(see two paragaphs below, also in the original mail)

Note that we're here entering the realm of a second stage proposal (I
should have named them); I think I can make a good point for the above
proposal, but am unsure myself on what is below.

> I’m not yet quite ready to embrace general function evaluation as a
> part of CBOR (outside unpacking, that is).

A portion of tags we have already is (especially if one is willing to
admit abstract numerics in the model of the expanded values): bignums,
floats, decimal fractions and rationals are just functions of the shape
m*b^e/d (with varying elements referenced or constant), half of the
encoding changing tags can be viewed that way (depending whether you see
the encoded or the decoded form as the one that's in the data model),
and the arrays clearly are.

(And thanks for naming it -- this makes it easier to phrase more sharply
what follows.)

> >  (The first example here does beg the question the 'put the argument in
> >  here' placeholder ARG is best phrased; I don't have any definite
> >  answer here yet, but maybe it makes sense to shift up the index space
> >  inside the argument-item space by 1 and make 6(0) the argument --
> >  although that might be similarly hard to teach as type-0/type-1.)
> 
> The current design clearly identifies function tags as such by using
> them as lhs (from a type-0 reference) or rhs (from a type-1 reference)
> of an argument reference.

When going with this stage of proposal here, the generic function
evaluation and the expansion of argument references would be
independent.

If we were to use dictionary entries in that template style (with ARG =
6(0)), then *every* entry in the argument dictionary takes an ARG
(otherwise it'd be a shared item).

> — A function tag might be used in its meaning as a mere tag, i.e., not
> automatically evaluating.

That is and would really stay the case for any tag used here. Whether
TBDcat([[1, 2], [3, 4]]) is expanded to [1, 2, 3, 4] or handed to the
application as a tag with nested arrays is as much up to the application
as it is whether 64(h'01020304') is expanded to that or not.

The only difference is that in the context of packed, some function tags
will start to occur that were not efficient before.

BR
c

-- 
To use raw power is to make yourself infinitely vulnerable to greater powers.
  -- Bene Gesserit axiom