[Wpack] On double-hashing (was: Re: About content-based origins)

Devin Mullins <twifkak@google.com> Wed, 25 March 2020 23:06 UTC

Return-Path: <twifkak@google.com>
X-Original-To: wpack@ietfa.amsl.com
Delivered-To: wpack@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E82443A00D5 for <wpack@ietfa.amsl.com>; Wed, 25 Mar 2020 16:06:05 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -9.588
X-Spam-Level:
X-Spam-Status: No, score=-9.588 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, T_SPF_TEMPERROR=0.01, URIBL_BLOCKED=0.001, USER_IN_DEF_DKIM_WL=-7.5] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=google.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 0nYtUdr-4KeB for <wpack@ietfa.amsl.com>; Wed, 25 Mar 2020 16:05:55 -0700 (PDT)
Received: from mail-wm1-x329.google.com (mail-wm1-x329.google.com [IPv6:2a00:1450:4864:20::329]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1D99E3A0E18 for <wpack@ietf.org>; Wed, 25 Mar 2020 16:04:13 -0700 (PDT)
Received: by mail-wm1-x329.google.com with SMTP id m3so5113466wmi.0 for <wpack@ietf.org>; Wed, 25 Mar 2020 16:04:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=sHfB165DRx7m0D44kdyA0qv4oF7MTEIDV5m06soMXHg=; b=pZVUjXd2DI0AHDGv04acU4KJT2yjXPq4sOXEjsKLaKzJaTtUY3v/eugF6L9R5e7kTT AcN6z82eR7oXiScfT++QZAvxUJXj8QISFvFz7q21VooGsszBzs3KsV2FbFXbrZHCOrWk JLuxygrsL24ID6apQ/Oj+KGrx2kJYI5bh/rUr5aVMJuLei0EKpTEB6eL7T/Jxp8mGMKs NpclIaWllrBbU0VlXKmoCLO5Ldm+aPO3clYZMSsXAA/bGEivwxVRhhFuhSPnutMFj4xI uy6g/1jNQjd4OgaY2uCsBoOCs/PnhY0ppgbyA32ukIZzs/OnD6o/ROzyNI8/Ieh8aZ/a RWUw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=sHfB165DRx7m0D44kdyA0qv4oF7MTEIDV5m06soMXHg=; b=thie6J3SX8kepZ7NqwwKuvXYsElB8szyU/rfjdbk+iK6x+l34KMRGX/d6pM5DehiL2 Pu7EtcEdrbgcgkhokUPhL3P2qHh7rXgn5lb743zl1tvyE+0gXgXJOhIkZUdQ6SRFSrDN A56Ovkmb3nChiass7HuYbfRwQFR1TppDV0A977VkpR7W6H722pvgnSmgA2AZ9MrLVB9J lzJ2kLeoJfdoxr1PZz5eBZAbn6u7Dm4TvmrCFtqg7HNZBmJFbMabDinwnPQ53ggxXz7b CLLNAlvrRbOhx+VMkkaME2rhstxo0F7/mrwhTH+Io5RG4T0KCKbKRbZfjeiQuzDVOAvX PXrw==
X-Gm-Message-State: ANhLgQ1NFR4g/JQzClniq0hia+9vTJlFvLy1mwAuM1FoVEegkYzSFjwH KkSwg7EUZKBz58Vfc7ITbudno5mlHZtBGzXFfoLy5Ggx0aw=
X-Google-Smtp-Source: ADFU+vsKkBAKk0I7+SwnUiqgHvdonImzTwwQiCEsTEvGVdWCJ2aUyJT4kJ8nzi4A6+kl6nJ+hqAYP0pmpLY2E21xt1w=
X-Received: by 2002:a1c:f409:: with SMTP id z9mr6026493wma.51.1585177449810; Wed, 25 Mar 2020 16:04:09 -0700 (PDT)
MIME-Version: 1.0
References: <260dfc2f-8399-483e-859d-08f92821c823@www.fastmail.com> <CANjwSimZAkAC0JJBjUjZr4k0514QRqDxBReOkq_AGTeGJ2OTzQ@mail.gmail.com>
In-Reply-To: <CANjwSimZAkAC0JJBjUjZr4k0514QRqDxBReOkq_AGTeGJ2OTzQ@mail.gmail.com>
From: Devin Mullins <twifkak@google.com>
Date: Wed, 25 Mar 2020 16:03:42 -0700
Message-ID: <CANjwSiniWmO+pTfFOdxW9tasy_eQiUiGwWvTsWF2KGR8yGtXqA@mail.gmail.com>
To: Martin Thomson <mt@lowentropy.net>
Cc: wpack@ietf.org
Content-Type: multipart/alternative; boundary="000000000000cdaa3905a1b5e221"
Archived-At: <https://mailarchive.ietf.org/arch/msg/wpack/iujGXmhXaBrJN81zjTuZqondzww>
Subject: [Wpack] On double-hashing (was: Re: About content-based origins)
X-BeenThere: wpack@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Web Packaging <wpack.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/wpack>, <mailto:wpack-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/wpack/>
List-Post: <mailto:wpack@ietf.org>
List-Help: <mailto:wpack-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/wpack>, <mailto:wpack-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 25 Mar 2020 23:06:06 -0000

In light of my comment about racing the Sec-CO request, a colleague pointed
out a potential hole in the double-hash scheme. There may be flaws in
below; I haven't thought about it too much. <trite tone=apologetic>It may
be worth more clearly defining the threat model.</trite>

The publisher could delay responding to the Sec-CO request. In the interim,
the page could send its own hash to //publisher.example/this-is-my-hash.
Then the publisher would respond with the preimage sent by the bundle.
This requires:

A way for the bundle to get its own hash. Ideas:

   1. document.location -- we could fix the ni scheme to be based on the
   double-hash
   2. a fetch of distributor.example/what-is-the-hash-of-referer
   3. compute client-side based on document.body, in which the publisher
   could also include an encoding of the headers
   4. somebody computes the fixmanifold of a hash function -- is that a
   thing? i can't math [1]

A way for the publisher to associate the this-is-my-hash request with the
Sec-CO request. Ideas:

   1. IP+timestamp
   2. a token as a query param to both requests

This depends on the publisher server to have an ability that most don't
have, but it wouldn't be hard to purpose-build a server that only handled
this-is-my-hash and Sec-CO requests. By segregating its workload from the
other content requests, it would be easier to scale -- its need for
session-stickiness or regional storage wouldn't affect latency of the
edge-serving of normal content.

Overall, this still requires more effort than responding `Sec-CO: yes`, but
possibly less than a distributor/publisher backchannel. Also, possibly
immune to IP privacy depending on above.

I hope first that my above napkin sketch is wrong, and secondly that the
fix for this isn't deferral of all requests until post-transfer. My guess
is that would harm performance of the instant navigation use case.

One possible mitigation is deferral of all on-origin requests. Then there
needs to be a backchannel between the this-is-my-hash origin and the state
transfer origin. This is maybe vaguely isomorphic to the
distributor/publisher backchannel? Not really, since subdomains. The
publisher could easily CNAME sec-co.publisher.example to
sec-co.distributor.example.

Actually, I wonder if CNAME is an issue regardless of everything above.
https://tools.ietf.org/html/draft-thomson-wpack-content-origin-00#section-5.1
states the UA must follow redirects. The publisher could easily encode a
redirect from publisher.example/foo to sec-co.publisher.example/foo, and
then CNAME to the distributor, and let the distributor respond with its
known preimage.

[1] "fixmanifold" being vaguely an extension of a fixpoint. A function f(x)
s.t. for all x, hash(concat(x, f(x)) == f(x). Then the bundle could encode
its own hash.

On Tue, Mar 24, 2020 at 11:35 PM Devin Mullins <twifkak@google.com> wrote:

> Hi Martin,
>
> Thanks for the engineering work on this. For others, I've already reviewed
> this, had some chats with others on the AMP team, and had a few
> back-and-forths with Martin (he's been very gracious in engaging his time,
> thank you). I'm trying to estimate the feasibility of this, both for the
> AMP use-case, and for the non-AMP use-case [1], which we hope to see
> flourish to the extent that publishers and users desire it, and only
> impeded by implementation costs where necessary for the common good.
>
> With the caveat that this is very early feedback, the high-level bit is
> that, with some modifications, it seems mostly feasible for AMP. However,
> there are two main downsides:
>   1. worse UX on sites that run A/B experiments
>   2. technical constraints make this harder for publishers to adopt on
> non-AMP, as compared to SXG
>
> A lot of details follow. I suggest to rename the subject if you reply to a
> detail below, in order to reduce cross-talk and make browsing the archive
> easier.
>
> Details:
>
> We would need the display URL to be the attested URL even before transfer.
> I suspect a flash in the URL bar would be frustrating for both users and
> publishers. Jeffrey had proposed using signatures for this. I'd suppose UAs
> could choose to render this with a "may be stale" indicator.
>
> In terms of publisher implementation feasibility, I think the mostly
> likely implementation would be for a publisher to produce a
> Sec-Content-Origin (Sec-CO) response for the current version of the
> resource iff it matches the Sec-CO request. I suspect that keeping state of
> past versions of the resource would be infeasible, especially as it
> couldn't be done purely at the edge; it would require something like a
> region-wide cache. The value of the above stateless implementation would be
> inversely proportional to how often the resource changes (e.g. "transient
> content" [2] such as Related Articles). I think this would be a workable
> constraint for publishers wishing to publish such bundles, but felt it's
> worth noting, nonetheless.
>
> Re: fallback behavior on failed state transfer, distributors would need a
> way to monitor failure rates. This serves two purposes:
>   1. Automatically detecting errors in the distribution pipeline. This is
> the same purpose served by
> https://wicg.github.io/webpackage/loading.html#signed-exchange-report.
>   2. Verifying that, for some fraction of navigations, bundles are meeting
> AMP's stated UX goals (by using the verified content in the bundle and not
> potentially arbitrary content after the redirect). This is a dynamic
> equivalent of what can be done mostly statically with SXG, because the
> distributor can run the same algorithm as the UA, modulo client/server skew
> in clocks and root stores. I think this is a trade-off AMP could make. The
> alternative is that the fallback behavior keeps the content-based origin
> (CBO).
>
> There is a possible vector for user ID transfer from distributor to
> publisher; filed as
> https://github.com/martinthomson/wpack-content/issues/1.
>
> This limits the ability to run session-sticky experiments (in
> https://amp.dev/documentation/components/amp-experiment/ and many non-AMP
> frameworks). The publisher has the following options:
>   1. Delay rendering content under experiment until after state transfer.
> This hurts UX for the sake of experimentation, creating a trade-off that
> otherwise doesn't exist on origin (modulo ease of implementation).
>   2. Generate a client-side UUID before state transfer, and join it with
> the pre-existing user ID after transfer. This means the session could not
> include CBO pageviews anywhere except in the first pageview.
>   3. Generate variants of the bundle for each experimental state. The
> distributor chooses which bundle to send to the user (based on what? not
> sure). For a page with M independent experiments each with N arms, that's
> N^M variants. For large enough M and N, this is likely both infeasible for
> publishers and martinthomson/wpack-content#1. (But I need some guidance
> from experts on typical ranges for M and N.)
>
> Perhaps this is a need the UA could address. A straw proposal -- a
> selectExperimentArm() API that:
>   1. only exposes as many bits as the UA deems OK, and
>   2. exposes different bits to different attested origins, so they can't
> be joined
> On the one hand, perhaps this is too much scope for wpack. On the other,
> it addresses a problem that is somewhat particular to CBOs.
>
> This impedes content management (e.g. dialogs for GDPR and CCPA), compared
> to SXG. The publisher has a few choices:
>   1. Delay rendering the dialog under after transfer. I think this would
> cause layout shift (https://web.dev/cls/).
>   2. Render the dialog, and risk discovering that it's already been
> acknowledged. At that point:
>     a. Hide it, causing layout shift.
>     b. Leave it there. This is way outside my domain expertise, but ISTR
> there are some rules or at least best practices wrt when it's okay to
> re-show an opt-out dialog.
> One of these options may be an okay trade-off; I'm not sure.
>
> Analytics providers (via
> https://amp.dev/documentation/components/amp-analytics/ and many non-AMP
> libraries) would have to do one of two things:
>   1. Delay pingback until after transfer. This risks a lower fraction of
> successful pingbacks, as e.g. the user might close the tab before transfer.
>   2. Generate a client-side UUID, send pingback before transfer, and send
> a follow-up after transfer to join with pre-existing user ID. This captures
> those otherwise lost pingbacks pseudonymously (anonymously?), though it
> requires server-side work on the pingback endpoint.
>
> For the above client-side mitigations, two feature requests seem necessary:
>   1. Store the transfer metadata (e.g. indexedDB renames by time)
> somewhere more permanent. For various reasons the event handler may not
> fire or finish, and it would be good to later scan for stragglers to merge
> or expunge.
>   2. DOM API to wait for transfer to complete.
>
> Some subresources are ACLs by Origin header (e.g. paid fonts). Likely
> mitigation is to defer loading until transfer, with a timeout whose
> duration depends on the publisher's font-display preference. (Martin
> suggests an optimization where the encrypted subresource is loaded early,
> and only the decryption key after transfer.)
>
> Last but not least, I am concerned that this is a bar that is
> disproportionately difficult for non-AMP publishers to meet. The above
> client-side mitigations are possible, but quite a bit of work.
>   1. For frameworks like AMP that strongly encourage publishers to use a
> rolling release, it is possible to upgrade many use-cases in the wild with
> minimal publisher effort.
>   2. For other frameworks, it may be possible to provide this support, but
> the requirement to upgrade versions may limit applicability. Because
> upgrades occur less often, they often require more work.
>   3. For custom JS that manages state, authors will need to modify it to
> move or copy state into indexedDB, and handle transfer events and merge
> conflicts.
>
> From a "minimal publisher effort" perspective, I am especially interested
> in making it possible to have as close to a turnkey solution as is
> reasonable, for instance at the level of the CDN or CMS. If such a solution
> exists, sites would likely contain a hybrid of all three cases above, and
> thus want to opt into such support incrementally. How should a CDN
> determine which pages to provide Sec-CO responses for?
>   1. URL patterns are the simplest answer, but I suspect hard for
> publishers to create and maintain with minimal false negatives and
> positives.
>   2. Would it be helpful to thread an "I support state transfer"
> annotation for some portion of the build journey, from individual JS
> function all the way to bundle? Would folks use this?
>
> I don't have a good solution to the issue of non-AMP developer cost, and
> thus fear this is a solution that would feature a disproportionate amount
> of AMP pages compared to the background distribution of the web. But I'm
> hopeful that other JS framework developers can chime in with respect to
> feasibility of support for CBO and state transfer.
>
> I'm leaving out some of Martin's replies to the above; no offense
> intended. It took me long enough just to summarize the above, and I'd
> better send it early enough so at least a few people have time to read it
> before the meeting.
>
> Thanks,
> Devin
>
> [1]
> https://blog.amp.dev/2019/05/22/privacy-preserving-instant-loading-for-all-web-content/
> [2]
> http://www.seobythesea.com/2011/12/how-google-might-identify-transient-content-on-webpages/
>
> On Mon, Mar 23, 2020 at 5:35 PM Martin Thomson <mt@lowentropy.net> wrote:
>
>> Ted's note prompted me to send a much-belated announcement (sorry folks,
>> I forgot).
>>
>> The draft is here:
>> https://tools.ietf.org/html/draft-thomson-wpack-content-origin-00
>>
>> A nicer version here:
>>
>> https://martinthomson.github.io/wpack-content/draft-thomson-wpack-content-origin.html
>>
>> This approach could a dramatically different approach to addressing the
>> use cases set out in our charter.
>>
>> In short, this aims to address the core question of how offline content
>> might *ultimately* be attributed to a web origin in a fundamentally
>> different way.  There are two key concepts:
>>
>> 1. Content is given its own origin, using a new system for identification.
>>
>> 2. A target origin can "accept" content and state from one of these new
>> origins.
>>
>> There are a lot of details here (read the draft), but the major advantage
>> I see is that you don't have to make an offline decision about authority,
>> and that means you can be offline for much longer (lifting the 7 day limit).
>>
>> What it does have in common with signed exchanges approach is the need
>> for a bundling format, but in its current form it is less dependent on the
>> details of the format.  That might allow that to be simpler, but I'm sure
>> that the need to mint new identifier types will more than make up for any
>> slack there.
>>
>> The draft is quite rough.  I'm sure that it has the remnants of a few bad
>> ideas still hanging around.  Ask questions if you think something is
>> unclear.
>>
>> _______________________________________________
>> Wpack mailing list
>> Wpack@ietf.org
>> https://www.ietf.org/mailman/listinfo/wpack
>>
>