Re: [Ppm] Batch selection and use cases for DAP

Christopher Patton <cpatton@cloudflare.com> Fri, 15 July 2022 00:49 UTC

MIME-Version: 1.0
References: <CAG2Zi212sWmk3Piuu4Q0YE+wcqhgObx9F7r=SJV5d3Xqy8tFkQ@mail.gmail.com> <CAGkoAS2ZH1s_E=B745-HrvYSkaXuO6xx_t+J57A8sJ23a36w8A@mail.gmail.com> <CABN231ohNh90_XBwEZsi_tdTo8Xc7_xyS1t98B2BZn4mCw7ufg@mail.gmail.com>
In-Reply-To: <CABN231ohNh90_XBwEZsi_tdTo8Xc7_xyS1t98B2BZn4mCw7ufg@mail.gmail.com>
From: Christopher Patton <cpatton@cloudflare.com>
Date: Thu, 14 Jul 2022 17:49:32 -0700
Message-ID: <CAG2Zi21TXZc+QwYtpm=5cqkW=5C3XPmWy+E3xXiBL2rAqgaDvQ@mail.gmail.com>
To: Tim Geoghegan <timg=40letsencrypt.org@dmarc.ietf.org>
Cc: Simon Friedberger <simon@mozilla.com>, Christopher Patton <cpatton=40cloudflare.com@dmarc.ietf.org>, ppm <ppm@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000ddb95305e3cd642c"
Archived-At: <https://mailarchive.ietf.org/arch/msg/ppm/FoqT2_iwKDL_G0_bImeg9kLWiqg>
Subject: Re: [Ppm] Batch selection and use cases for DAP
Precedence: list

Tim,

I have one hair to split with Chris' summary of the current state of play:
>
> > Upon receiving this request from the Collector, the Leader picks a batch
> of reports with timestamps that fall in the batch interval.
>
> The important nuance is that the leader and helper independently figure
> out what set of reports fall within the collector's chosen batch interval.
> Obviously, the helper's view of what reports exist is determined by what
> reports the leader forwards to it (that is, until we move to a split upload
> model[1]), but the leader doesn't send the helper a list of report IDs (I
> believe this is an important property because that list would be quite
> large). I bring this up so that everyone remembers that the protocol must
> guarantee that leader and helper agree on the set of reports included in an
> aggregation.
>

Yeah this is definitely worth calling out. To put it more precisely: Only
reports that are deemed valid are included in the batch; and there are many
reasons why the Helper might choose to reject a report.


> With that in mind, I wonder if the existing
> `AggregateShareReq.batch_interval` sent from leader to helper suffices for
> #3 (this is an alternative to Shan Wang's idea of surfacing the aggregation
> job ID <=> batch ID mapping in the protocol messages). We can impose a
> total ordering on reports, so once the leader has identified a set of
> reports that satisfy the desired batch size, it should be able to devise an
> interval that captures at least (or I think exactly) those reports and send
> that to the helper. My thinking is that we can minimize the protocol text
> and code needed for the helper to support these different modes of
> aggregations.
>

We previously rejected the idea of imposing a total ordering on reports.
You might recall that the reason is that this amounts to a distributed
computing problem that is a pain-in-the-neck for the Leader to solve.
(Imagine a leader composed of a number of distributed worker nodes that
have to coordinate to order the reports.)



> I have two thoughts on use case #2: First, if we want the leader to be
> able to select reports based on metadata provided by the collector, then I
> think the `Report`[2] message needs to include some metadata that can be
> matched against `CollectReq.interval-metadata.metadata`. Would we then also
> need DAP to define some kind of query language, so that a collector could
> express something like "aggregate over the reports where `report.foo = bar
> && 0 <= report.qux <= 100`"?
>

Yeah something like that would be needed.


> Second, I worry that this means aggregators can't begin accumulating
> prepared output shares into aggregates until they get the
> `CollectReq.interval-metadata.metadata` value. If a deployment knows it
> wants to break out aggregations by something like the client's country,
> could it not define distinct tasks for each value? I'm perfectly willing to
> be convinced that this is not practical, but I feel it's important to
> consider doing nothing, and also I think it'd be nice to have potential
> deployers of DAP spell out more explicitly what kinds of aggregation use
> cases they have.
>

I agree, it would be useful if we didn't have to slice data arbitrarily so
that output shares can be "pre-merged", like we can do for batch windows
today.

Chris P.

[Ppm] Batch selection and use cases for DAP Christopher Patton
Re: [Ppm] Batch selection and use cases for DAP Shan Wang
Re: [Ppm] Batch selection and use cases for DAP Simon Friedberger
Re: [Ppm] Batch selection and use cases for DAP Tim Geoghegan
Re: [Ppm] Batch selection and use cases for DAP Shan Wang
Re: [Ppm] Batch selection and use cases for DAP Christopher Patton
Re: [Ppm] Batch selection and use cases for DAP Christopher Patton
Re: [Ppm] Batch selection and use cases for DAP Christopher Patton