Re: [Ppm] Revisit the operational requirements for the Helper?

Thanks Tim,

I had not internalized the naming scheme, so you are right.  It's clear (ish), but "task" is entirely the wrong concept to pair this to.  Of course, if it were structured in the way I describe, then it would be fine.  It's not, so let's talk about how it is structured.

The PR sets a narrow window about the current time over which the reports can be accepted.  A little into the past, a different amount into the future.  This isn't really helpful as it relates to anti-replay.  All you need here is a rolling window over which you are willing to track nonces for the purposes of anti-replay.  (Multiple discrete slices of time is probably best as that allows you to have entries fall off the back cheaply.)  Under that model, it's OK if the report was generated days or even weeks ago, as long as you are still tracking for that period.  If you want to ensure that your ingress system is able to take in reports and just stash them for processing later, then this design forces you to consider that ingress system to be part of what you trust. This check is done in real time, so it has to be done now.

There's no reason to do real-time checks.  You could build an ingress system that just sucks reports in and stores them without thinking too much.  Then, when you initiate processing you pass them to trusted components that do the anti-replay checks and the other validation tasks.  None of that requires invoking now().

You also don't really need to worry about time tolerance factors.  At least not as described in the PR.  You need some allowance for reports to arrive into the future, because clock skew is a real thing.  But you can get that by including a little overflow into the future.  The only way in which you need the current time is in determining

The design I implemented for TLS looks something like this:

At any given time you have T windows of time, each W seconds wide.  That means you accept reports for a window of about T*W seconds around the current time.  Each window has a collection of nonces.  As all reports have a unique timestamp you use that to find which collection to use (w = floor(t / W)).  New reports are accepted and added to the corresponding collection if they are not present; or they are dropped if the nonce is present.  On a fixed W second interval (i.e., when (now() + skew_adjustment) % W == 0) you empty out the oldest window.  Anything older than now() + skew_adjustment - T*W or newer than now() + skew_adjustment (careful with fenceposts here) is automatically dropped.

Note that this means that while the window is T*W seconds, you are always accepting timestamps between skew_adjustment and W+skew_adjustment into the future.  The integrity of the system doesn't care about time travel, so that's fine, you just want to avoid rejecting reports just because a client has a dodgy clock.  

Another aspects of this is that you don't really care if the time is accurate or not.  Clients could give you time that is rounded to W and things would work fine under this scheme: you just end up leaning on the randomness a little harder.  (I say this because the fine details of the time series of events could, in some circumstances, be information that a customer of the PPM system might not want to entrust to the system.)

I'm not getting any of that from the pull request.

BTW, regarding the pull vs. push model for submissions, I remind you that we're not documenting a system, we're standardizing something that is hopefully useful for other things.

On Wed, Dec 8, 2021, at 11:37, Tim Geoghegan wrote:
> Hi Martin,
>
> I think you may be misunderstanding the purpose or scope of the task ID 
> (or perhaps I should say: the draft is not sufficiently clear about how 
> task ID works). You write:
>
>>You publicize a task ID at some time.  Clients produce reports over some period and when you have enough you close the task and generate an aggregate. 
>
> ...suggesting a 1:1 relationship between a task ID and an aggregate 
> emitted by a PPM deployment. Rather, a task ID describes a long-running 
> report collection and aggregation task, for which aggregates are 
> emitted periodically (the period or _aggregation window_ is one of the 
> parameters associated with a task ID). Further, many clients may be 
> contributing reports under a single task ID.
>
> So for example, imagine a web browser measuring whether users press a 
> certain button. The browser vendor would configure a PPM task and 
> distribute its parameters to two aggregators (n.b. the mechanism for 
> task parameter distribution is explicitly out of scope for the PPM 
> protocol). Then, the browser vendor ships a version of the browser to 
> all its users that knows about the task ID (the task ID could be 
> compiled into the browser or maybe the browser vendor has some secure 
> configuration mechanism). Browser installations then start submitting 
> reports to aggregators using the task ID.
>
>>Reports are generated on demand.  Perhaps you have a polling process that contacts clients at certain times, gives them a task ID and asks for a report.  In this model, reports might be generated all at once, which makes times less useful for avoiding collision.
>
> In PPM, clients push reports into aggregators rather than having 
> reports pulled (or scraped, as Prometheus would put it) from clients by 
> aggregators. This is necessary because clients will frequently be 
> behind NAT or egress only firewalls or unable to keep a port open for 
> metrics pulls for some other reason.
>
> Thanks,
> Tim
>
> On Tue, Dec 7, 2021 at 4:04 PM Martin Thomson <mt@lowentropy.net> wrote:
>> Thanks for putting this together Chris,
>> 
>> There is an implicit assumption in this design that says that reports are bound to a task that is known when the report is generated.  To that end, this design is probably not the best design that we might come up with.
>> 
>> If you assume this operating model, there are two sort of natural ways to approach the report collection process:
>> 
>> 1. You publicize a task ID at some time.  Clients produce reports over some period and when you have enough you close the task and generate an aggregate.  In this model, individual reports could be evenly distributed over the period fairly evenly, so using a timestamp as part of your nonce is a good way to improve collision resistance.
>> 
>> 2. Reports are generated on demand.  Perhaps you have a polling process that contacts clients at certain times, gives them a task ID and asks for a report.  In this model, reports might be generated all at once, which makes times less useful for avoiding collision.
>> 
>> In both cases, the task ID is what prevents any individual report from being used for multiple reports.  Replay within the task is assured by the uniqueness of the nonce, not the value of the timestamp.  
>> 
>> Of course, you could reuse a task ID, but we might reasonably assume that the aggregation service will not accept that.  Or, you could design a system in which the task ID is usable for multiple aggregation runs, but that seems like it defeats the purpose of having a high entropy task ID in the first place.  My thought is that it might be even better to have a leader assign task IDs so that uniqueness can be assured more easily.  If I'm right in that the task ID is central to the anti-replay design, then that is a more robust design.
>> 
>> With that in mind, I'm not sure that you even need a timestamp at all.  At least for this design.  I think that if we want to consider designs where the use of a single report is not so strictly Boolean, then we could need timestamps.
>> 
>> On Wed, Dec 8, 2021, at 10:18, Christopher Patton wrote:
>> > Thanks Erik, yeah brought up this consideration early on in the process 
>> > and is indeed the reason we have this mechanism.
>> >
>> > Here's a PR for implementing Martin's suggestion: 
>> > https://github.com/abetterinternet/ppm-specification/pull/169
>> >
>> > Reviews appreciated!
>> > Chris P.
>> >
>> > On Mon, Dec 6, 2021 at 10:45 AM Erik Taubeneck <erik.taubeneck@gmail.com> wrote:
>> >> Martin said:
>> >> 
>> >>> Having independent enforcement of anti-replay seems like it is important enough for this state to persist beyond the scope of a task.
>> >> I strongly agree here, and it seems worthwhile to include in the spec. This will be important for other privacy mechanisms, beyond anti-replay. For example, if I understand correctly, the proposed WICG spec Attribution Reporting API with Aggregate Reports <https://github.com/WICG/conversion-measurement-api/blob/main/AGGREGATE.md> aims to follow the PPM spec, and would require persistent state to manage a privacy budget. In addition, that API may even allow replay in a controlled manner within the privacy budget. From their proposal draft <https://github.com/WICG/conversion-measurement-api/blob/main/AGGREGATE.md#privacy-budgeting>:
>> >> 
>> >> 
>> >>> Note: The design for the aggregation service also requires some amount of budgeting to ensure records aren’t processed too many times.
>> >> 
>> >> Charlie Harrison can likely provide more context on that specific proposal, but I agree overall that we likely want to support more than just anti-replay to enforce the privacy of the PPM results. 
>> >> 
>> >> Cheers,
>> >> 
>> >> Erik
>> >> 
>> >> 
>> >> 
>> >> On Sat, Dec 4, 2021 at 3:18 PM Christopher Patton <cpatton=40cloudflare.com@dmarc.ietf.org> wrote:
>> >>> Hi Martin, thanks for your thoughtful reply. I really like your suggestion. I'd like to hear what others think, and in absence of a reason not to take it I think we should make this a PR :)
>> >>> 
>> >>> Best,
>> >>> Chris P.
>> >>> 
>> >>> On Thu, Dec 2, 2021 at 9:28 PM Martin Thomson <mt@lowentropy.net> wrote:
>> >>>> Hi Chris,
>> >>>> 
>> >>>> When you say
>> >>>> 
>> >>>> > This removes the Leader's coordination problem, at the cost 
>> >>>> > of additional storage requirements. In fact, both Aggregators need to 
>> >>>> > store the set of processed nonces for the duration of a task.
>> >>>> 
>> >>>> Having independent enforcement of anti-replay seems like it is important enough for this state to persist beyond the scope of a task.
>> >>>> 
>> >>>> Maybe stepping back and looking at how anti-replay is managed would help.  You don't want the same thing to be counted more than once.  The original design requires a total ordering of inputs, with a strict enforcement of that ordering.  That uses what is in effect a high resolution timestamp to ensuring that ordering is sensible.  It also means that tasks cannot be initiated out of order (or with overlapping inputs).  That seems unworkable to me.  Perfect global ordering, particularly when you consider helpers that serve multiple customers, is probably infeasible.
>> >>>> 
>> >>>> I think that #2 is the only reasonable outcome here.  However, you also need to use the timestamp (or part of it) to bound the time period during which you need to remember nonces.
>> >>>> 
>> >>>> That is, you use the time of an input to select from three different discard treatments:
>> >>>> 1. discard because it is too old
>> >>>> 2. discard because it is too new (which might be impossible, but you don't want something as simple as clock skew to be why your anti-replay was defeated)
>> >>>> 3. discard if it has the same nonce as something you have previously accepted
>> >>>> 
>> >>>> In code:
>> >>>> 
>> >>>> if t < now() - window:
>> >>>>   discard as too old
>> >>>> else if t > now():
>> >>>>   discard as from the future
>> >>>> else if nonce in nonces_in_window:
>> >>>>   discard as duplicate
>> >>>> else:
>> >>>>   add nonce to nonces_in_window
>> >>>>   accept the measurement
>> >>>> 
>> >>>> (There's an interesting potential attack here in that a client provides different nonces or timestamps to each helper with a goal of spoiling results. Accepting half a share could be disastrous.  The validation process catches those pretty naturally, I think, probably.  This is something we should be careful to check any time that independent processing occurs like this.)
>> >>>> 
>> >>>> Separately, I've been looking at less simplistic applications of something like anti-replay.  I hope to be able to talk about that in more detail soon, but it might be that the anti-replay design you have is too restrictive for what we have in mind, no matter the outcome of this specific decision.
>> >>>> 
>> >>>> -- 
>> >>>> Ppm mailing list
>> >>>> Ppm@ietf.org
>> >>>> https://www.ietf.org/mailman/listinfo/ppm
>> >>> -- 
>> >>> Ppm mailing list
>> >>> Ppm@ietf.org
>> >>> https://www.ietf.org/mailman/listinfo/ppm
>> >> -- 
>> >> Ppm mailing list
>> >> Ppm@ietf.org
>> >> https://www.ietf.org/mailman/listinfo/ppm
>> > -- 
>> > Ppm mailing list
>> > Ppm@ietf.org
>> > https://www.ietf.org/mailman/listinfo/ppm
>> 
>> -- 
>> Ppm mailing list
>> Ppm@ietf.org
>> https://www.ietf.org/mailman/listinfo/ppm
> -- 
> Ppm mailing list
> Ppm@ietf.org
> https://www.ietf.org/mailman/listinfo/ppm