Re: [Ppm] Batch selection and use cases for DAP

Tim Geoghegan <timg@letsencrypt.org> Tue, 05 July 2022 16:32 UTC

Return-Path: <timg@letsencrypt.org>
X-Original-To: ppm@ietfa.amsl.com
Delivered-To: ppm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id BF61EC13C52F for <ppm@ietfa.amsl.com>; Tue, 5 Jul 2022 09:32:26 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.85
X-Spam-Level:
X-Spam-Status: No, score=-7.85 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.745, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=letsencrypt.org
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id mWcBcwOsZ0Cc for <ppm@ietfa.amsl.com>; Tue, 5 Jul 2022 09:32:22 -0700 (PDT)
Received: from mail-lf1-x12a.google.com (mail-lf1-x12a.google.com [IPv6:2a00:1450:4864:20::12a]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 7E6B4C15AD46 for <ppm@ietf.org>; Tue, 5 Jul 2022 09:32:22 -0700 (PDT)
Received: by mail-lf1-x12a.google.com with SMTP id e12so21475362lfr.6 for <ppm@ietf.org>; Tue, 05 Jul 2022 09:32:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=letsencrypt.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=zQXwXB7q4efFp+NwHBM/wQzo5H2/2uJnHIPp63XNko8=; b=W4jJWMky5axXoElhtyd9m75e+wyd5Z4tQlrA6H3Cj/6pXTWRiZnoMpWbmcHvucrjaR BV9h9w8sOGvuKIjHBEAgQRbK5Ha9CP3gawajwY7RHf4cYINLKetsBJqLOB4wEC5GTA4a rY3BLvKmTZfvBerphUiFPd6Z/3ividQBvQzWw=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=zQXwXB7q4efFp+NwHBM/wQzo5H2/2uJnHIPp63XNko8=; b=Yc4WNGGI6PJmw8X/pCjBKciCi1hGkVLjwk5g7EpMaZAklEz2VGjusyyvliH9n1OuHV GDUFabzkHG/QlMA7gYB6ROl+kU4m4Y4w/GxfB5ylDqlWHPWSpHODjiwX4p60cXyg6XAD MUpTIAVFfNuGIdThHILjEyNzktiEJOo3mh3M177IoqQBulIbkf2XjNBt1sIRFq4NYI6i FdEk/1O06qsaApxKP8YYmt/sA4y7r+K+NEZQBlfcrVZsKhdcbwwe43vqcJ5y1ccQ8yzT +s+SGFFR0oblaonlmK1Ozz/NsyidJyaGnEYVnw1Df/7gsaZgUx3IxDX8Zd1fdYKb4BEd V9AQ==
X-Gm-Message-State: AJIora8cZUw3T5Umz4FD+3ShrCFc6Dja8+bgyry5HL8lfSFzLrKrEkPI EJq42Iyvk6N+A3SYUtT7wSi0mTlxm+NOVCziWNepgQ==
X-Google-Smtp-Source: AGRyM1vAOzw16ZulcdmtkJGzML2L2oIiMz1wWj5ICUNcXE4Q9Kjb7OLm+raqeLBOQDnSHHStfAWTom4DR3QeDZXOg0o=
X-Received: by 2002:a05:6512:ea4:b0:482:9f6b:ed1f with SMTP id bi36-20020a0565120ea400b004829f6bed1fmr12938614lfb.383.1657038740632; Tue, 05 Jul 2022 09:32:20 -0700 (PDT)
MIME-Version: 1.0
References: <CAG2Zi212sWmk3Piuu4Q0YE+wcqhgObx9F7r=SJV5d3Xqy8tFkQ@mail.gmail.com> <CAGkoAS2ZH1s_E=B745-HrvYSkaXuO6xx_t+J57A8sJ23a36w8A@mail.gmail.com>
In-Reply-To: <CAGkoAS2ZH1s_E=B745-HrvYSkaXuO6xx_t+J57A8sJ23a36w8A@mail.gmail.com>
From: Tim Geoghegan <timg@letsencrypt.org>
Date: Tue, 05 Jul 2022 09:32:09 -0700
Message-ID: <CABN231ohNh90_XBwEZsi_tdTo8Xc7_xyS1t98B2BZn4mCw7ufg@mail.gmail.com>
To: Simon Friedberger <simon@mozilla.com>
Cc: Christopher Patton <cpatton=40cloudflare.com@dmarc.ietf.org>, ppm <ppm@ietf.org>
Content-Type: multipart/alternative; boundary="00000000000083d6a605e3116593"
Archived-At: <https://mailarchive.ietf.org/arch/msg/ppm/pvcoPBm20n9k133Fq6DgmwDomfo>
Subject: Re: [Ppm] Batch selection and use cases for DAP
X-BeenThere: ppm@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Privacy Preserving Measurement technologies <ppm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ppm>, <mailto:ppm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ppm/>
List-Post: <mailto:ppm@ietf.org>
List-Help: <mailto:ppm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ppm>, <mailto:ppm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 05 Jul 2022 16:32:26 -0000

I have one hair to split with Chris' summary of the current state of play:

> Upon receiving this request from the Collector, the Leader picks a batch
of reports with timestamps that fall in the batch interval.

The important nuance is that the leader and helper independently figure out
what set of reports fall within the collector's chosen batch interval.
Obviously, the helper's view of what reports exist is determined by what
reports the leader forwards to it (that is, until we move to a split upload
model[1]), but the leader doesn't send the helper a list of report IDs (I
believe this is an important property because that list would be quite
large). I bring this up so that everyone remembers that the protocol must
guarantee that leader and helper agree on the set of reports included in an
aggregation.

With that in mind, I wonder if the existing
`AggregateShareReq.batch_interval` sent from leader to helper suffices for
#3 (this is an alternative to Shan Wang's idea of surfacing the aggregation
job ID <=> batch ID mapping in the protocol messages). We can impose a
total ordering on reports, so once the leader has identified a set of
reports that satisfy the desired batch size, it should be able to devise an
interval that captures at least (or I think exactly) those reports and send
that to the helper. My thinking is that we can minimize the protocol text
and code needed for the helper to support these different modes of
aggregations.

I have two thoughts on use case #2: First, if we want the leader to be able
to select reports based on metadata provided by the collector, then I think
the `Report`[2] message needs to include some metadata that can be matched
against `CollectReq.interval-metadata.metadata`. Would we then also need
DAP to define some kind of query language, so that a collector could
express something like "aggregate over the reports where `report.foo = bar
&& 0 <= report.qux <= 100`"?

Second, I worry that this means aggregators can't begin accumulating
prepared output shares into aggregates until they get the
`CollectReq.interval-metadata.metadata` value. If a deployment knows it
wants to break out aggregations by something like the client's country,
could it not define distinct tasks for each value? I'm perfectly willing to
be convinced that this is not practical, but I feel it's important to
consider doing nothing, and also I think it'd be nice to have potential
deployers of DAP spell out more explicitly what kinds of aggregation use
cases they have.

Thanks,
Tim

[1] https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/130
[2] https://datatracker.ietf.org/doc/html/draft-ietf-ppm-dap#section-4.2.2

On Tue, Jul 5, 2022 at 1:42 AM Simon Friedberger <simon@mozilla.com> wrote:

> Not the answer to your questions (a) or (b) but another question:
>
> We are assuming that correctness of results requires
> all-honest-aggregators but privacy is protected by a
> single-honest-aggregator.
> So, could report selection be simplified by reducing the scope to the
> interaction between Collector and Leader?
> I.e., the leader would select reports based on what the collector wants to
> achieve and the aggregators only have to check the privacy criteria, so
> currently, min_batch_size and max_batch_lifetime.
>
> Best,
> Simon
>
> On Thu, Jun 30, 2022 at 1:22 AM Christopher Patton <cpatton=
> 40cloudflare.com@dmarc.ietf.org> wrote:
>
>> Hi all,
>>
>> The current version of DAP prescribes a particular method of sorting
>> reports into batches for aggregation. There are a couple of GitHub issues
>> that describe use cases for which this method is not well-suited.
>> First-class support for these use cases would require protocol changes.
>> While considering if/how to change it, I think it would be helpful to take
>> a step back and ask ourselves if there are any additional use cases to
>> consider.
>>
>> First, a quick recap of how batches are currently defined:
>> * Reports are generated by Clients and uploaded to the Leader. Each
>> `Report` has a timestamp.
>> * In its collect request (i.e., HTTP request with an `CollectReq`), the
>> Collector specifies a "batch interval", which defines the start and end
>> time for reports that will be aggregated.
>> * Upon receiving this request from the Collector, the Leader picks a
>> batch of reports with timestamps that fall in the batch interval.
>> * The leader and Helper aggregate the batch. (Aggregation manifests as a
>> sequence of `AggregateInitializeReq` / `AggregateContinueReq` flows,
>> followed by a single `AggregateShareReq` in order to get the Helper's
>> aggregate share for the entire batch.)
>>
>> Observe that the batch itself is chosen by the Leader; the Collector
>> merely specifies criteria for reports that are "valid" for that batch --
>> namely, that the report timestamp falls in the batch interval. Other
>> criteria for selecting batches are possible, as I'll explain below.
>>
>> Use case #1
>> The current "batch selector" is well-suited for telemetry use cases where
>> DAP is used to aggregate long-running time-series data. (For those familiar
>> with Prometheus (https://prometheus.io), think of a dashboard you would
>> build to monitor how long it takes browsers to download and render a page
>> of your website.) However it has some limitations that make other use cases
>> much more difficult.
>>
>> Use case #2
>> As EKR points out in issue183 (
>> https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/183), it would
>> also be useful to be able to "filter" aggregate results based on metadata
>> associated with reports. For example, one might need to break down results
>> by user agent, geographical region, software version, etc. Supporting this
>> functionality requires a different "batch selector", one that also accounts
>> for additional dimensions along which batches can be sliced.
>>
>> Use case #3
>> In issue273 (https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/273),
>> Shan Wang points to an altogether different (and, arguably, much simpler)
>> batch selection criterion: Instead of sorting reports into batch intervals,
>> we may simply want to ensure that batches are pairwise disjoint. Moreover,
>> our application might require that each batch is exactly the same size (or
>> at least within some small threshold).
>>
>> Today, DAP only has first-class support for #1. Use cases #2 and #3 can
>> kind of be implemented, but it would be painful. For my part, I would be in
>> favor of adding protocol mechanisms in order to provide first-class support
>> for additional use cases that are likely to be common. As a straw man,
>> consider the following revised `CollectReq`:
>>
>> ```
>> + enum {
>> +   reserved(0),
>> +   interval(1), // For use case #1
>> +   interval-metadata(2), // Use case #2
>> +   fixed(3), // Use case #3
>> + } BatchSelector;
>>
>> struct {
>>   TaskID task_id;
>> - Interval batch_interval;
>> + BatchSelector batch_selector;
>> + select (batch_selector) {
>> +   case interval:
>> +      Interval batch_interval;
>> +   case interval-metadata:
>> +      Interval batch_interval;
>> +      opaque metadata<0..2^8-1>; // "User-Agent", etc.
>> +   case fixed:
>> +      uint64 batch_id;
>> + };
>>   opaque agg_param<0..2^16-1>; // VDAF aggregation parameter
>> } CollectReq;
>> ```
>>
>> What this expresses is that the "batch interval" has been replaced by one
>> of several "batch selectors", each designed to support a different (set of)
>> use case(s). Each has some associated parameters used by the Leader to
>> guide report selection. For example, the `fixed` selector encodes the
>> "batch ID", as defined in issue273. It seems to me that something like this
>> could work. Things to consider:
>> * Both Aggregators need to be able to enforce that the batch meets the
>> criteria specified by the Collector.
>> * There are implications for storage requirements for the Aggregators.
>> * There is also issue issue195 (
>> https://github.com/ietf-wg-ppm/draft-ietf-ppm-dap/issues/195) in which
>> Chris Wood points out some privacy implications regarding the flexibility
>> afforded to the Collector in choosing the batch selection criteria. (This
>> issue needs to be addressed in any case.)
>>
>> Anyway ... thoughts? Specifically:
>> (a) Is there a use case we're missing here?
>> (b) What do you think of making changes to the protocol in order to
>> support additional use cases?
>>
>> Cheers,
>> Chris P.
>>
>> --
>> Ppm mailing list
>> Ppm@ietf.org
>> https://www.ietf.org/mailman/listinfo/ppm
>>
> --
> Ppm mailing list
> Ppm@ietf.org
> https://www.ietf.org/mailman/listinfo/ppm
>