Re: [Ppm] Batch selection and use cases for DAP

Shan Wang <shan_wang@apple.com> Mon, 04 July 2022 18:17 UTC

From: Shan Wang <shan_wang@apple.com>
Content-type: multipart/alternative; boundary="Apple-Mail=_E37FC922-F819-4673-862A-31AA43F64D9C"
MIME-version: 1.0 (Mac OS X Mail 15.0 \(3693.20.0.1.32\))
Date: Mon, 04 Jul 2022 19:17:23 +0100
References: <mailman.88.1656615603.44393.ppm@ietf.org>
To: ppm@ietf.org
In-reply-to: <mailman.88.1656615603.44393.ppm@ietf.org>
Message-id: <4FAC07F7-85DA-498C-8A4A-880A08BB499C@apple.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/ppm/fyOr87xSabAhGxcjAvmNCmjm0e0>
Subject: Re: [Ppm] Batch selection and use cases for DAP
Precedence: list

I definitely support adding first class support for different batch selector in the protocol, especially for the size based selector. I can see it benefits the following use cases:

1). Tasks that need aggregate result as soon as privacy guarantee is met, rather than waiting for a fixed interval to elapse.  
2). Tasks that need similar sized batches (therefore similar signal-to-noise ratio) to compare results of different batches.
3). Tasks that need strong privacy guarantee based on batch size, for e.g. central differential privacy.

With batch size based collection, aggregators just need to make sure each client report falls in only one batch, and only emit a batch when its size has met `min_batch_size`. This selection is easier to understand and arguably simpler to implement, also it doesn't subject to privacy concerns associated with interval slicing, as mentioned by issue195 (https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/195 <https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/195>)

I think Chris's straw man design is a good starting point. For `BatchSelector.fixed` type, collector creates a `batch_id` to identify a collection, so multiple request to the same `batch_id` return the same result (if `max_batch_lifetime` > 1). 

Upon receiving `CollectReq` of fixed type, leader does {{batch-parameter-validation}} but skipping any checks to do with interval, instead it checks if the same batch_id has been collected before. Then leader begins working with the helpers to prepare shares for this `batch_id` (or continues this process, depending on the VDAF). 

In {{aggregate-flow}}, instead of sorting each report share into interval based buckets, leader and helper can save them to `aggregate_share` identified by `aggregate_job_id`. The leader can send `batch_id` along with `aggregate_job_id` to helpers using `AggregateReq`, so helper can associate `batch_id` with multiple `aggregate_job_ids`. The same `batch_id` can be used to collect prepared `aggregate_shares` during {{collect-aggregate}} flow.

Alternatively leader can associate `batch_id` with `aggregate_job_ids` internally, then send helpers the list of `aggregate_job_ids` for a `batch_id` during {{collect-aggregate}} flow. This way `AggregateReq` can remain unchanged. 

Also `batch_interval` is used as AAD in Aggregate Share Encryption (https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/blob/main/draft-ietf-ppm-dap.md#aggregate-share-encryption-aggregate-share-encrypt). For the `fixed` selector, `batch_id` or hash of `aggregate_job_ids` list should be used instead. 

Regards,
Shan

> From: Christopher Patton <cpatton@cloudflare.com>
> Subject: [Ppm] Batch selection and use cases for DAP
> Date: 30 June 2022 at 00:21:47 BST
> To: ppm <ppm@ietf.org>
> 
> 
> Hi all,
> 
> The current version of DAP prescribes a particular method of sorting reports into batches for aggregation. There are a couple of GitHub issues that describe use cases for which this method is not well-suited. First-class support for these use cases would require protocol changes. While considering if/how to change it, I think it would be helpful to take a step back and ask ourselves if there are any additional use cases to consider.
> 
> First, a quick recap of how batches are currently defined:
> * Reports are generated by Clients and uploaded to the Leader. Each `Report` has a timestamp.
> * In its collect request (i.e., HTTP request with an `CollectReq`), the Collector specifies a "batch interval", which defines the start and end time for reports that will be aggregated.
> * Upon receiving this request from the Collector, the Leader picks a batch of reports with timestamps that fall in the batch interval.
> * The leader and Helper aggregate the batch. (Aggregation manifests as a sequence of `AggregateInitializeReq` / `AggregateContinueReq` flows, followed by a single `AggregateShareReq` in order to get the Helper's aggregate share for the entire batch.)
> 
> Observe that the batch itself is chosen by the Leader; the Collector merely specifies criteria for reports that are "valid" for that batch -- namely, that the report timestamp falls in the batch interval. Other criteria for selecting batches are possible, as I'll explain below.
> 
> Use case #1
> The current "batch selector" is well-suited for telemetry use cases where DAP is used to aggregate long-running time-series data. (For those familiar with Prometheus (https://prometheus.io <https://prometheus.io/>), think of a dashboard you would build to monitor how long it takes browsers to download and render a page of your website.) However it has some limitations that make other use cases much more difficult.
> 
> Use case #2
> As EKR points out in issue183 (https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/183 <https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/183>), it would also be useful to be able to "filter" aggregate results based on metadata associated with reports. For example, one might need to break down results by user agent, geographical region, software version, etc. Supporting this functionality requires a different "batch selector", one that also accounts for additional dimensions along which batches can be sliced.
> 
> Use case #3
> In issue273 (https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/273 <https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/273>), Shan Wang points to an altogether different (and, arguably, much simpler) batch selection criterion: Instead of sorting reports into batch intervals, we may simply want to ensure that batches are pairwise disjoint. Moreover, our application might require that each batch is exactly the same size (or at least within some small threshold).
> 
> Today, DAP only has first-class support for #1. Use cases #2 and #3 can kind of be implemented, but it would be painful. For my part, I would be in favor of adding protocol mechanisms in order to provide first-class support for additional use cases that are likely to be common. As a straw man, consider the following revised `CollectReq`:
> 
> ```
> + enum {
> + <>   reserved(0),
> +   interval(1), // For use case #1
> +   interval-metadata(2), // Use case #2
> +   fixed(3), // Use case #3
> + } BatchSelector;
> 
> struct {
>   TaskID task_id;
> - Interval batch_interval;
> + <> BatchSelector batch_selector;
> + select (batch_selector) {
> +   case interval: 
> +      Interval batch_interval;
> +   case interval-metadata:
> +      Interval batch_interval;
> +      opaque metadata<0..2^8-1>; // "User-Agent", etc.
> +   case fixed:
> +      uint64 batch_id;
> + };
>   opaque agg_param<0..2^16-1>; // VDAF aggregation parameter
> } CollectReq;
> ```
> 
> What this expresses is that the "batch interval" has been replaced by one of several "batch selectors", each designed to support a different (set of) use case(s). Each has some associated parameters used by the Leader to guide report selection. For example, the `fixed` selector encodes the "batch ID", as defined in issue273. It seems to me that something like this could work. Things to consider:
> * Both Aggregators need to be able to enforce that the batch meets the criteria specified by the Collector.
> * There are implications for storage requirements for the Aggregators.
> * There is also issue issue195 (https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/195 <https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/195>) in which Chris Wood points out some privacy implications regarding the flexibility afforded to the Collector in choosing the batch selection criteria. (This issue needs to be addressed in any case.)
> 
> Anyway ... thoughts? Specifically:
> (a) Is there a use case we're missing here?
> (b) What do you think of making changes to the protocol in order to support additional use cases?
> 
> Cheers,
> Chris P.
> 
>

[Ppm] Batch selection and use cases for DAP Christopher Patton
Re: [Ppm] Batch selection and use cases for DAP Shan Wang
Re: [Ppm] Batch selection and use cases for DAP Simon Friedberger
Re: [Ppm] Batch selection and use cases for DAP Tim Geoghegan
Re: [Ppm] Batch selection and use cases for DAP Shan Wang
Re: [Ppm] Batch selection and use cases for DAP Christopher Patton
Re: [Ppm] Batch selection and use cases for DAP Christopher Patton
Re: [Ppm] Batch selection and use cases for DAP Christopher Patton