Re: [Wpack] On double-hashing (was: Re: About content-based origins)

Augh, I forgot that 30x responses change the attested URL. Ignore my
comments about CNAME, I suppose.

On Wed, Mar 25, 2020 at 4:03 PM Devin Mullins <twifkak@google.com> wrote:

> In light of my comment about racing the Sec-CO request, a colleague
> pointed out a potential hole in the double-hash scheme. There may be flaws
> in below; I haven't thought about it too much. <trite tone=apologetic>It
> may be worth more clearly defining the threat model.</trite>
>
> The publisher could delay responding to the Sec-CO request. In the
> interim, the page could send its own hash to
> //publisher.example/this-is-my-hash. Then the publisher would respond with
> the preimage sent by the bundle. This requires:
>
> A way for the bundle to get its own hash. Ideas:
>
>    1. document.location -- we could fix the ni scheme to be based on the
>    double-hash
>    2. a fetch of distributor.example/what-is-the-hash-of-referer
>    3. compute client-side based on document.body, in which the publisher
>    could also include an encoding of the headers
>    4. somebody computes the fixmanifold of a hash function -- is that a
>    thing? i can't math [1]
>
> A way for the publisher to associate the this-is-my-hash request with the
> Sec-CO request. Ideas:
>
>    1. IP+timestamp
>    2. a token as a query param to both requests
>
> This depends on the publisher server to have an ability that most don't
> have, but it wouldn't be hard to purpose-build a server that only handled
> this-is-my-hash and Sec-CO requests. By segregating its workload from the
> other content requests, it would be easier to scale -- its need for
> session-stickiness or regional storage wouldn't affect latency of the
> edge-serving of normal content.
>
> Overall, this still requires more effort than responding `Sec-CO: yes`,
> but possibly less than a distributor/publisher backchannel. Also, possibly
> immune to IP privacy depending on above.
>
> I hope first that my above napkin sketch is wrong, and secondly that the
> fix for this isn't deferral of all requests until post-transfer. My guess
> is that would harm performance of the instant navigation use case.
>
> One possible mitigation is deferral of all on-origin requests. Then there
> needs to be a backchannel between the this-is-my-hash origin and the state
> transfer origin. This is maybe vaguely isomorphic to the
> distributor/publisher backchannel? Not really, since subdomains. The
> publisher could easily CNAME sec-co.publisher.example to
> sec-co.distributor.example.
>
> Actually, I wonder if CNAME is an issue regardless of everything above.
> https://tools.ietf.org/html/draft-thomson-wpack-content-origin-00#section-5.1
> states the UA must follow redirects. The publisher could easily encode a
> redirect from publisher.example/foo to sec-co.publisher.example/foo, and
> then CNAME to the distributor, and let the distributor respond with its
> known preimage.
>
> [1] "fixmanifold" being vaguely an extension of a fixpoint. A function
> f(x) s.t. for all x, hash(concat(x, f(x)) == f(x). Then the bundle could
> encode its own hash.
>
> On Tue, Mar 24, 2020 at 11:35 PM Devin Mullins <twifkak@google.com> wrote:
>
>> Hi Martin,
>>
>> Thanks for the engineering work on this. For others, I've already
>> reviewed this, had some chats with others on the AMP team, and had a few
>> back-and-forths with Martin (he's been very gracious in engaging his time,
>> thank you). I'm trying to estimate the feasibility of this, both for the
>> AMP use-case, and for the non-AMP use-case [1], which we hope to see
>> flourish to the extent that publishers and users desire it, and only
>> impeded by implementation costs where necessary for the common good.
>>
>> With the caveat that this is very early feedback, the high-level bit is
>> that, with some modifications, it seems mostly feasible for AMP. However,
>> there are two main downsides:
>>   1. worse UX on sites that run A/B experiments
>>   2. technical constraints make this harder for publishers to adopt on
>> non-AMP, as compared to SXG
>>
>> A lot of details follow. I suggest to rename the subject if you reply to
>> a detail below, in order to reduce cross-talk and make browsing the archive
>> easier.
>>
>> Details:
>>
>> We would need the display URL to be the attested URL even before
>> transfer. I suspect a flash in the URL bar would be frustrating for both
>> users and publishers. Jeffrey had proposed using signatures for this. I'd
>> suppose UAs could choose to render this with a "may be stale" indicator.
>>
>> In terms of publisher implementation feasibility, I think the mostly
>> likely implementation would be for a publisher to produce a
>> Sec-Content-Origin (Sec-CO) response for the current version of the
>> resource iff it matches the Sec-CO request. I suspect that keeping state of
>> past versions of the resource would be infeasible, especially as it
>> couldn't be done purely at the edge; it would require something like a
>> region-wide cache. The value of the above stateless implementation would be
>> inversely proportional to how often the resource changes (e.g. "transient
>> content" [2] such as Related Articles). I think this would be a workable
>> constraint for publishers wishing to publish such bundles, but felt it's
>> worth noting, nonetheless.
>>
>> Re: fallback behavior on failed state transfer, distributors would need a
>> way to monitor failure rates. This serves two purposes:
>>   1. Automatically detecting errors in the distribution pipeline. This is
>> the same purpose served by
>> https://wicg.github.io/webpackage/loading.html#signed-exchange-report.
>>   2. Verifying that, for some fraction of navigations, bundles are
>> meeting AMP's stated UX goals (by using the verified content in the bundle
>> and not potentially arbitrary content after the redirect). This is a
>> dynamic equivalent of what can be done mostly statically with SXG, because
>> the distributor can run the same algorithm as the UA, modulo client/server
>> skew in clocks and root stores. I think this is a trade-off AMP could make.
>> The alternative is that the fallback behavior keeps the content-based
>> origin (CBO).
>>
>> There is a possible vector for user ID transfer from distributor to
>> publisher; filed as
>> https://github.com/martinthomson/wpack-content/issues/1.
>>
>> This limits the ability to run session-sticky experiments (in
>> https://amp.dev/documentation/components/amp-experiment/ and many
>> non-AMP frameworks). The publisher has the following options:
>>   1. Delay rendering content under experiment until after state transfer.
>> This hurts UX for the sake of experimentation, creating a trade-off that
>> otherwise doesn't exist on origin (modulo ease of implementation).
>>   2. Generate a client-side UUID before state transfer, and join it with
>> the pre-existing user ID after transfer. This means the session could not
>> include CBO pageviews anywhere except in the first pageview.
>>   3. Generate variants of the bundle for each experimental state. The
>> distributor chooses which bundle to send to the user (based on what? not
>> sure). For a page with M independent experiments each with N arms, that's
>> N^M variants. For large enough M and N, this is likely both infeasible for
>> publishers and martinthomson/wpack-content#1. (But I need some guidance
>> from experts on typical ranges for M and N.)
>>
>> Perhaps this is a need the UA could address. A straw proposal -- a
>> selectExperimentArm() API that:
>>   1. only exposes as many bits as the UA deems OK, and
>>   2. exposes different bits to different attested origins, so they can't
>> be joined
>> On the one hand, perhaps this is too much scope for wpack. On the other,
>> it addresses a problem that is somewhat particular to CBOs.
>>
>> This impedes content management (e.g. dialogs for GDPR and CCPA),
>> compared to SXG. The publisher has a few choices:
>>   1. Delay rendering the dialog under after transfer. I think this would
>> cause layout shift (https://web.dev/cls/).
>>   2. Render the dialog, and risk discovering that it's already been
>> acknowledged. At that point:
>>     a. Hide it, causing layout shift.
>>     b. Leave it there. This is way outside my domain expertise, but ISTR
>> there are some rules or at least best practices wrt when it's okay to
>> re-show an opt-out dialog.
>> One of these options may be an okay trade-off; I'm not sure.
>>
>> Analytics providers (via
>> https://amp.dev/documentation/components/amp-analytics/ and many non-AMP
>> libraries) would have to do one of two things:
>>   1. Delay pingback until after transfer. This risks a lower fraction of
>> successful pingbacks, as e.g. the user might close the tab before transfer.
>>   2. Generate a client-side UUID, send pingback before transfer, and send
>> a follow-up after transfer to join with pre-existing user ID. This captures
>> those otherwise lost pingbacks pseudonymously (anonymously?), though it
>> requires server-side work on the pingback endpoint.
>>
>> For the above client-side mitigations, two feature requests seem
>> necessary:
>>   1. Store the transfer metadata (e.g. indexedDB renames by time)
>> somewhere more permanent. For various reasons the event handler may not
>> fire or finish, and it would be good to later scan for stragglers to merge
>> or expunge.
>>   2. DOM API to wait for transfer to complete.
>>
>> Some subresources are ACLs by Origin header (e.g. paid fonts). Likely
>> mitigation is to defer loading until transfer, with a timeout whose
>> duration depends on the publisher's font-display preference. (Martin
>> suggests an optimization where the encrypted subresource is loaded early,
>> and only the decryption key after transfer.)
>>
>> Last but not least, I am concerned that this is a bar that is
>> disproportionately difficult for non-AMP publishers to meet. The above
>> client-side mitigations are possible, but quite a bit of work.
>>   1. For frameworks like AMP that strongly encourage publishers to use a
>> rolling release, it is possible to upgrade many use-cases in the wild with
>> minimal publisher effort.
>>   2. For other frameworks, it may be possible to provide this support,
>> but the requirement to upgrade versions may limit applicability. Because
>> upgrades occur less often, they often require more work.
>>   3. For custom JS that manages state, authors will need to modify it to
>> move or copy state into indexedDB, and handle transfer events and merge
>> conflicts.
>>
>> From a "minimal publisher effort" perspective, I am especially interested
>> in making it possible to have as close to a turnkey solution as is
>> reasonable, for instance at the level of the CDN or CMS. If such a solution
>> exists, sites would likely contain a hybrid of all three cases above, and
>> thus want to opt into such support incrementally. How should a CDN
>> determine which pages to provide Sec-CO responses for?
>>   1. URL patterns are the simplest answer, but I suspect hard for
>> publishers to create and maintain with minimal false negatives and
>> positives.
>>   2. Would it be helpful to thread an "I support state transfer"
>> annotation for some portion of the build journey, from individual JS
>> function all the way to bundle? Would folks use this?
>>
>> I don't have a good solution to the issue of non-AMP developer cost, and
>> thus fear this is a solution that would feature a disproportionate amount
>> of AMP pages compared to the background distribution of the web. But I'm
>> hopeful that other JS framework developers can chime in with respect to
>> feasibility of support for CBO and state transfer.
>>
>> I'm leaving out some of Martin's replies to the above; no offense
>> intended. It took me long enough just to summarize the above, and I'd
>> better send it early enough so at least a few people have time to read it
>> before the meeting.
>>
>> Thanks,
>> Devin
>>
>> [1]
>> https://blog.amp.dev/2019/05/22/privacy-preserving-instant-loading-for-all-web-content/
>> [2]
>> http://www.seobythesea.com/2011/12/how-google-might-identify-transient-content-on-webpages/
>>
>> On Mon, Mar 23, 2020 at 5:35 PM Martin Thomson <mt@lowentropy.net> wrote:
>>
>>> Ted's note prompted me to send a much-belated announcement (sorry folks,
>>> I forgot).
>>>
>>> The draft is here:
>>> https://tools.ietf.org/html/draft-thomson-wpack-content-origin-00
>>>
>>> A nicer version here:
>>>
>>> https://martinthomson.github.io/wpack-content/draft-thomson-wpack-content-origin.html
>>>
>>> This approach could a dramatically different approach to addressing the
>>> use cases set out in our charter.
>>>
>>> In short, this aims to address the core question of how offline content
>>> might *ultimately* be attributed to a web origin in a fundamentally
>>> different way.  There are two key concepts:
>>>
>>> 1. Content is given its own origin, using a new system for
>>> identification.
>>>
>>> 2. A target origin can "accept" content and state from one of these new
>>> origins.
>>>
>>> There are a lot of details here (read the draft), but the major
>>> advantage I see is that you don't have to make an offline decision about
>>> authority, and that means you can be offline for much longer (lifting the 7
>>> day limit).
>>>
>>> What it does have in common with signed exchanges approach is the need
>>> for a bundling format, but in its current form it is less dependent on the
>>> details of the format.  That might allow that to be simpler, but I'm sure
>>> that the need to mint new identifier types will more than make up for any
>>> slack there.
>>>
>>> The draft is quite rough.  I'm sure that it has the remnants of a few
>>> bad ideas still hanging around.  Ask questions if you think something is
>>> unclear.
>>>
>>> _______________________________________________
>>> Wpack mailing list
>>> Wpack@ietf.org
>>> https://www.ietf.org/mailman/listinfo/wpack
>>>
>>