Re: [Ppm] Batch selection and use cases for DAP

Shan Wang <shan_wang@apple.com> Mon, 04 July 2022 18:17 UTC

Return-Path: <shan_wang@apple.com>
X-Original-To: ppm@ietfa.amsl.com
Delivered-To: ppm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 346D6C15AD50 for <ppm@ietfa.amsl.com>; Mon, 4 Jul 2022 11:17:38 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.852
X-Spam-Level:
X-Spam-Status: No, score=-2.852 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.745, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=apple.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id creDue3W3Niv for <ppm@ietfa.amsl.com>; Mon, 4 Jul 2022 11:17:32 -0700 (PDT)
Received: from ma1-aaemail-dr-lapp03.apple.com (ma1-aaemail-dr-lapp03.apple.com [17.171.2.72]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 2898FC15AD4D for <ppm@ietf.org>; Mon, 4 Jul 2022 11:17:31 -0700 (PDT)
Received: from pps.filterd (ma1-aaemail-dr-lapp03.apple.com [127.0.0.1]) by ma1-aaemail-dr-lapp03.apple.com (8.16.0.42/8.16.0.42) with SMTP id 264I9rSv052055 for <ppm@ietf.org>; Mon, 4 Jul 2022 11:17:29 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=apple.com; h=from : content-type : mime-version : subject : date : references : to : in-reply-to : message-id; s=20180706; bh=9tVE8lY7OWqN6uIJKs1Z6CHYvHCOcDqRHWLV3nvDU2w=; b=CGW0YA+bff0BRlqXsYQDTrGjjZgMNu6JXoyV3C90T1xz5ufJt46TxpJ1yZ0IUEmI9Cve Pjy8pl6TOTADjwZbP2wFEvntrnFyvXPK0cZF2VXtiT2RBqgoNJeZ5wRPqG7fghkXqCJL RZ3w2rtqrTeD4HRmqJm99Qd6ZK31KU3+N3WwfdEUvuggVdo2ht/nJXwS0PvGXxDX1+OS ozzkvEPg7llh0IJN6MM0cKz9NsxzAb23fK4Qrx4UXsip8ryjn4cyV/xZX/vUZYsGHO1A QRhr2Ed2jraaq5r4gsZ7ZeFccrWNEZu9Ri2I4Uuos08DVsC1q+8hB6RFJjaQLzpPYwXx 4A==
Received: from crk-mailsvcp-mta-lapp04.euro.apple.com (crk-mailsvcp-mta-lapp04.euro.apple.com [17.66.55.17]) by ma1-aaemail-dr-lapp03.apple.com with ESMTP id 3h2n0vgjv4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for <ppm@ietf.org>; Mon, 04 Jul 2022 11:17:29 -0700
Received: from crk-mailsvcp-mmp-lapp03.euro.apple.com (crk-mailsvcp-mmp-lapp03.euro.apple.com [17.72.136.17]) by crk-mailsvcp-mta-lapp04.euro.apple.com (Oracle Communications Messaging Server 8.1.0.18.20220407 64bit (built Apr 7 2022)) with ESMTPS id <0REI00CMEDH48A00@crk-mailsvcp-mta-lapp04.euro.apple.com> for ppm@ietf.org; Mon, 04 Jul 2022 19:17:28 +0100 (IST)
Received: from process_milters-daemon.crk-mailsvcp-mmp-lapp03.euro.apple.com by crk-mailsvcp-mmp-lapp03.euro.apple.com (Oracle Communications Messaging Server 8.1.0.18.20220407 64bit (built Apr 7 2022)) id <0REI00700D4HA600@crk-mailsvcp-mmp-lapp03.euro.apple.com> for ppm@ietf.org; Mon, 04 Jul 2022 19:17:28 +0100 (IST)
X-Va-A:
X-Va-T-CD: 15d91a843370bb16a07b18b1a805455a
X-Va-E-CD: 81e01362b7c6686d180e57a53d2ec502
X-Va-R-CD: 5739430f3b8110afd18ee0823f2f97a7
X-Va-CD: 0
X-Va-ID: a4a45409-6ac9-447b-a7f2-698dc46e958d
X-V-A:
X-V-T-CD: 15d91a843370bb16a07b18b1a805455a
X-V-E-CD: 81e01362b7c6686d180e57a53d2ec502
X-V-R-CD: 5739430f3b8110afd18ee0823f2f97a7
X-V-CD: 0
X-V-ID: da320d94-db43-4e0f-9b16-d04a1cd241d7
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.517, 18.0.883 definitions=2022-07-04_17:2022-06-28, 2022-07-04 signatures=0
Received: from smtpclient.apple ([17.232.81.108]) by crk-mailsvcp-mmp-lapp03.euro.apple.com (Oracle Communications Messaging Server 8.1.0.18.20220407 64bit (built Apr 7 2022)) with ESMTPSA id <0REI00C2ZDH05L00@crk-mailsvcp-mmp-lapp03.euro.apple.com> for ppm@ietf.org; Mon, 04 Jul 2022 19:17:28 +0100 (IST)
From: Shan Wang <shan_wang@apple.com>
Content-type: multipart/alternative; boundary="Apple-Mail=_E37FC922-F819-4673-862A-31AA43F64D9C"
MIME-version: 1.0 (Mac OS X Mail 15.0 \(3693.20.0.1.32\))
Date: Mon, 04 Jul 2022 19:17:23 +0100
References: <mailman.88.1656615603.44393.ppm@ietf.org>
To: ppm@ietf.org
In-reply-to: <mailman.88.1656615603.44393.ppm@ietf.org>
Message-id: <4FAC07F7-85DA-498C-8A4A-880A08BB499C@apple.com>
X-Mailer: Apple Mail (2.3693.20.0.1.32)
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.517, 18.0.883 definitions=2022-07-04_17:2022-06-28, 2022-07-04 signatures=0
Archived-At: <https://mailarchive.ietf.org/arch/msg/ppm/fyOr87xSabAhGxcjAvmNCmjm0e0>
Subject: Re: [Ppm] Batch selection and use cases for DAP
X-BeenThere: ppm@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Privacy Preserving Measurement technologies <ppm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ppm>, <mailto:ppm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ppm/>
List-Post: <mailto:ppm@ietf.org>
List-Help: <mailto:ppm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ppm>, <mailto:ppm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 04 Jul 2022 18:17:38 -0000

I definitely support adding first class support for different batch selector in the protocol, especially for the size based selector. I can see it benefits the following use cases:

1). Tasks that need aggregate result as soon as privacy guarantee is met, rather than waiting for a fixed interval to elapse.  
2). Tasks that need similar sized batches (therefore similar signal-to-noise ratio) to compare results of different batches.
3). Tasks that need strong privacy guarantee based on batch size, for e.g. central differential privacy.

With batch size based collection, aggregators just need to make sure each client report falls in only one batch, and only emit a batch when its size has met `min_batch_size`. This selection is easier to understand and arguably simpler to implement, also it doesn't subject to privacy concerns associated with interval slicing, as mentioned by issue195 (https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/195 <https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/195>)

I think Chris's straw man design is a good starting point. For `BatchSelector.fixed` type, collector creates a `batch_id` to identify a collection, so multiple request to the same `batch_id` return the same result (if `max_batch_lifetime` > 1). 

Upon receiving `CollectReq` of fixed type, leader does {{batch-parameter-validation}} but skipping any checks to do with interval, instead it checks if the same batch_id has been collected before. Then leader begins working with the helpers to prepare shares for this `batch_id` (or continues this process, depending on the VDAF). 

In {{aggregate-flow}}, instead of sorting each report share into interval based buckets, leader and helper can save them to `aggregate_share` identified by `aggregate_job_id`. The leader can send `batch_id` along with `aggregate_job_id` to helpers using `AggregateReq`, so helper can associate `batch_id` with multiple `aggregate_job_ids`. The same `batch_id` can be used to collect prepared `aggregate_shares` during {{collect-aggregate}} flow.

Alternatively leader can associate `batch_id` with `aggregate_job_ids` internally, then send helpers the list of `aggregate_job_ids` for a `batch_id` during {{collect-aggregate}} flow. This way `AggregateReq` can remain unchanged. 

Also `batch_interval` is used as AAD in Aggregate Share Encryption (https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/blob/main/draft-ietf-ppm-dap.md#aggregate-share-encryption-aggregate-share-encrypt). For the `fixed` selector, `batch_id` or hash of `aggregate_job_ids` list should be used instead. 

Regards,
Shan

> From: Christopher Patton <cpatton@cloudflare.com>
> Subject: [Ppm] Batch selection and use cases for DAP
> Date: 30 June 2022 at 00:21:47 BST
> To: ppm <ppm@ietf.org>
> 
> 
> Hi all,
> 
> The current version of DAP prescribes a particular method of sorting reports into batches for aggregation. There are a couple of GitHub issues that describe use cases for which this method is not well-suited. First-class support for these use cases would require protocol changes. While considering if/how to change it, I think it would be helpful to take a step back and ask ourselves if there are any additional use cases to consider.
> 
> First, a quick recap of how batches are currently defined:
> * Reports are generated by Clients and uploaded to the Leader. Each `Report` has a timestamp.
> * In its collect request (i.e., HTTP request with an `CollectReq`), the Collector specifies a "batch interval", which defines the start and end time for reports that will be aggregated.
> * Upon receiving this request from the Collector, the Leader picks a batch of reports with timestamps that fall in the batch interval.
> * The leader and Helper aggregate the batch. (Aggregation manifests as a sequence of `AggregateInitializeReq` / `AggregateContinueReq` flows, followed by a single `AggregateShareReq` in order to get the Helper's aggregate share for the entire batch.)
> 
> Observe that the batch itself is chosen by the Leader; the Collector merely specifies criteria for reports that are "valid" for that batch -- namely, that the report timestamp falls in the batch interval. Other criteria for selecting batches are possible, as I'll explain below.
> 
> Use case #1
> The current "batch selector" is well-suited for telemetry use cases where DAP is used to aggregate long-running time-series data. (For those familiar with Prometheus (https://prometheus.io <https://prometheus.io/>), think of a dashboard you would build to monitor how long it takes browsers to download and render a page of your website.) However it has some limitations that make other use cases much more difficult.
> 
> Use case #2
> As EKR points out in issue183 (https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/183 <https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/183>), it would also be useful to be able to "filter" aggregate results based on metadata associated with reports. For example, one might need to break down results by user agent, geographical region, software version, etc. Supporting this functionality requires a different "batch selector", one that also accounts for additional dimensions along which batches can be sliced.
> 
> Use case #3
> In issue273 (https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/273 <https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/273>), Shan Wang points to an altogether different (and, arguably, much simpler) batch selection criterion: Instead of sorting reports into batch intervals, we may simply want to ensure that batches are pairwise disjoint. Moreover, our application might require that each batch is exactly the same size (or at least within some small threshold).
> 
> Today, DAP only has first-class support for #1. Use cases #2 and #3 can kind of be implemented, but it would be painful. For my part, I would be in favor of adding protocol mechanisms in order to provide first-class support for additional use cases that are likely to be common. As a straw man, consider the following revised `CollectReq`:
> 
> ```
> + enum {
> + <>   reserved(0),
> +   interval(1), // For use case #1
> +   interval-metadata(2), // Use case #2
> +   fixed(3), // Use case #3
> + } BatchSelector;
> 
> struct {
>   TaskID task_id;
> - Interval batch_interval;
> + <> BatchSelector batch_selector;
> + select (batch_selector) {
> +   case interval: 
> +      Interval batch_interval;
> +   case interval-metadata:
> +      Interval batch_interval;
> +      opaque metadata<0..2^8-1>; // "User-Agent", etc.
> +   case fixed:
> +      uint64 batch_id;
> + };
>   opaque agg_param<0..2^16-1>; // VDAF aggregation parameter
> } CollectReq;
> ```
> 
> What this expresses is that the "batch interval" has been replaced by one of several "batch selectors", each designed to support a different (set of) use case(s). Each has some associated parameters used by the Leader to guide report selection. For example, the `fixed` selector encodes the "batch ID", as defined in issue273. It seems to me that something like this could work. Things to consider:
> * Both Aggregators need to be able to enforce that the batch meets the criteria specified by the Collector.
> * There are implications for storage requirements for the Aggregators.
> * There is also issue issue195 (https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/195 <https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/195>) in which Chris Wood points out some privacy implications regarding the flexibility afforded to the Collector in choosing the batch selection criteria. (This issue needs to be addressed in any case.)
> 
> Anyway ... thoughts? Specifically:
> (a) Is there a use case we're missing here?
> (b) What do you think of making changes to the protocol in order to support additional use cases?
> 
> Cheers,
> Chris P.
> 
>