Re: [Ppm] Batch selection and use cases for DAP

Christopher Patton <cpatton@cloudflare.com> Fri, 15 July 2022 00:49 UTC

Return-Path: <cpatton@cloudflare.com>
X-Original-To: ppm@ietfa.amsl.com
Delivered-To: ppm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 47623C16ECFB for <ppm@ietfa.amsl.com>; Thu, 14 Jul 2022 17:49:49 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.106
X-Spam-Level:
X-Spam-Status: No, score=-7.106 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cloudflare.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id I2n_ZVVbpKM7 for <ppm@ietfa.amsl.com>; Thu, 14 Jul 2022 17:49:45 -0700 (PDT)
Received: from mail-ed1-x52f.google.com (mail-ed1-x52f.google.com [IPv6:2a00:1450:4864:20::52f]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 19BDAC16ECF3 for <ppm@ietf.org>; Thu, 14 Jul 2022 17:49:45 -0700 (PDT)
Received: by mail-ed1-x52f.google.com with SMTP id t3so4510345edd.0 for <ppm@ietf.org>; Thu, 14 Jul 2022 17:49:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=cZw34dCSv43W5nu26Vx8+6xRoL/75yOgVhuuVP4gzAQ=; b=VHBI7XqQ/fSoM7wHgn5vYhh5ZaNeclUHMjna4ArOcG/8L0XNINog/NcZ3MCeaZz7b4 Ye1PfAMOB+TfvaNAxdr3TQiL3ZO+IzjHFjcmP8qrKoufn2ZhlPLRuOzu1UfvDu3d6Rnj 4vk4I0kg6xZsJse6kyhTctrMpz7fIGVNVYLj4=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=cZw34dCSv43W5nu26Vx8+6xRoL/75yOgVhuuVP4gzAQ=; b=R5qBJpLkLuTiLul2PCOVbfesrWFGmtzzbJTvNVJUruoXUkvYDdSVXiTTcPyqJyos7p EDas3pgS/7Img3GjNbUbwQ1bL7zMCosz6c4Y/phItDk52lvWdA7LWmNeDeNtM37oooBY HxBIErc6FQ0dmGsoV6QeOJ/n7PWpYa9wf88OdP7xWGIEFn4kQ9+O6jITJ6Inwf+EDpnM HJtW4Hb22PVOiitSKPmL9LSgouNnec/x/55yOO8GRDmEHuSOuQHCCNZZfc3eyKfJup8W 6To/0Owlo+yh4lgDylRYpQNxhkXNXED8q/OyU/P9oGIHCALz9nsKusQaoWQDpOIrEvuX XijA==
X-Gm-Message-State: AJIora/va8/0/5ZSoQWkQajFMQ7BVzxS5GsyF5Sl//egec1aVGdEKZyd hl48pplOSwpuc7z5H+6cDswZhJ+gzXUHPtMnqDrEvKO0gKo=
X-Google-Smtp-Source: AGRyM1spDFBBsvOzq4agArOoGjeyUcT/kPODUn4tKmCT2B3YLptM5QmncwusT05WKP7P0lClqOAb0aQtPpyGO1OFgN8=
X-Received: by 2002:aa7:db9a:0:b0:43a:76bf:5401 with SMTP id u26-20020aa7db9a000000b0043a76bf5401mr15448168edt.244.1657846183575; Thu, 14 Jul 2022 17:49:43 -0700 (PDT)
MIME-Version: 1.0
References: <CAG2Zi212sWmk3Piuu4Q0YE+wcqhgObx9F7r=SJV5d3Xqy8tFkQ@mail.gmail.com> <CAGkoAS2ZH1s_E=B745-HrvYSkaXuO6xx_t+J57A8sJ23a36w8A@mail.gmail.com> <CABN231ohNh90_XBwEZsi_tdTo8Xc7_xyS1t98B2BZn4mCw7ufg@mail.gmail.com>
In-Reply-To: <CABN231ohNh90_XBwEZsi_tdTo8Xc7_xyS1t98B2BZn4mCw7ufg@mail.gmail.com>
From: Christopher Patton <cpatton@cloudflare.com>
Date: Thu, 14 Jul 2022 17:49:32 -0700
Message-ID: <CAG2Zi21TXZc+QwYtpm=5cqkW=5C3XPmWy+E3xXiBL2rAqgaDvQ@mail.gmail.com>
To: Tim Geoghegan <timg=40letsencrypt.org@dmarc.ietf.org>
Cc: Simon Friedberger <simon@mozilla.com>, Christopher Patton <cpatton=40cloudflare.com@dmarc.ietf.org>, ppm <ppm@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000ddb95305e3cd642c"
Archived-At: <https://mailarchive.ietf.org/arch/msg/ppm/FoqT2_iwKDL_G0_bImeg9kLWiqg>
Subject: Re: [Ppm] Batch selection and use cases for DAP
X-BeenThere: ppm@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Privacy Preserving Measurement technologies <ppm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ppm>, <mailto:ppm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ppm/>
List-Post: <mailto:ppm@ietf.org>
List-Help: <mailto:ppm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ppm>, <mailto:ppm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Jul 2022 00:49:49 -0000

Tim,

I have one hair to split with Chris' summary of the current state of play:
>
> > Upon receiving this request from the Collector, the Leader picks a batch
> of reports with timestamps that fall in the batch interval.
>
> The important nuance is that the leader and helper independently figure
> out what set of reports fall within the collector's chosen batch interval.
> Obviously, the helper's view of what reports exist is determined by what
> reports the leader forwards to it (that is, until we move to a split upload
> model[1]), but the leader doesn't send the helper a list of report IDs (I
> believe this is an important property because that list would be quite
> large). I bring this up so that everyone remembers that the protocol must
> guarantee that leader and helper agree on the set of reports included in an
> aggregation.
>

Yeah this is definitely worth calling out. To put it more precisely: Only
reports that are deemed valid are included in the batch; and there are many
reasons why the Helper might choose to reject a report.


> With that in mind, I wonder if the existing
> `AggregateShareReq.batch_interval` sent from leader to helper suffices for
> #3 (this is an alternative to Shan Wang's idea of surfacing the aggregation
> job ID <=> batch ID mapping in the protocol messages). We can impose a
> total ordering on reports, so once the leader has identified a set of
> reports that satisfy the desired batch size, it should be able to devise an
> interval that captures at least (or I think exactly) those reports and send
> that to the helper. My thinking is that we can minimize the protocol text
> and code needed for the helper to support these different modes of
> aggregations.
>

We previously rejected the idea of imposing a total ordering on reports.
You might recall that the reason is that this amounts to a distributed
computing problem that is a pain-in-the-neck for the Leader to solve.
(Imagine a leader composed of a number of distributed worker nodes that
have to coordinate to order the reports.)



> I have two thoughts on use case #2: First, if we want the leader to be
> able to select reports based on metadata provided by the collector, then I
> think the `Report`[2] message needs to include some metadata that can be
> matched against `CollectReq.interval-metadata.metadata`. Would we then also
> need DAP to define some kind of query language, so that a collector could
> express something like "aggregate over the reports where `report.foo = bar
> && 0 <= report.qux <= 100`"?
>

Yeah something like that would be needed.


> Second, I worry that this means aggregators can't begin accumulating
> prepared output shares into aggregates until they get the
> `CollectReq.interval-metadata.metadata` value. If a deployment knows it
> wants to break out aggregations by something like the client's country,
> could it not define distinct tasks for each value? I'm perfectly willing to
> be convinced that this is not practical, but I feel it's important to
> consider doing nothing, and also I think it'd be nice to have potential
> deployers of DAP spell out more explicitly what kinds of aggregation use
> cases they have.
>

I agree, it would be useful if we didn't have to slice data arbitrarily so
that output shares can be "pre-merged", like we can do for batch windows
today.

Chris P.