Forward proxies and CDN/mirrors

Jack Bates <jzej8k@nottheoilrig.com> Sat, 19 May 2012 07:54 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2058121F86AF for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 19 May 2012 00:54:00 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -9.566
X-Spam-Level:
X-Spam-Status: No, score=-9.566 tagged_above=-999 required=5 tests=[BAYES_40=-0.185, GB_I_LETTER=-2, RCVD_IN_DNSWL_HI=-8, RCVD_IN_SORBS_WEB=0.619]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id aYzUWwF17fHD for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 19 May 2012 00:53:59 -0700 (PDT)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id 4DB9121F85C2 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Sat, 19 May 2012 00:53:55 -0700 (PDT)
Received: from lists by frink.w3.org with local (Exim 4.69) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1SVePk-0007cp-LI for ietf-http-wg-dist@listhub.w3.org; Sat, 19 May 2012 07:49:24 +0000
Received: from maggie.w3.org ([128.30.52.39]) by frink.w3.org with esmtp (Exim 4.69) (envelope-from <jzej8k@nottheoilrig.com>) id 1SVePY-0007Rp-QN for ietf-http-wg@listhub.w3.org; Sat, 19 May 2012 07:49:12 +0000
Received: from mail.nottheoilrig.com ([50.16.249.74]) by maggie.w3.org with esmtp (Exim 4.72) (envelope-from <jzej8k@nottheoilrig.com>) id 1SVePV-0007Sk-If for ietf-http-wg@w3.org; Sat, 19 May 2012 07:49:10 +0000
Received: from mail.nottheoilrig.com (localhost [127.0.0.1]) by mail.nottheoilrig.com (Postfix) with ESMTP id C5BC940B91 for <ietf-http-wg@w3.org>; Sat, 19 May 2012 07:48:53 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=nottheoilrig.com; s=mail; t=1337413733; bh=RIS3KtDmLBrJGe3YAxCYa5HzPjZSYNaOHl094LqwhR0=; h=Message-ID:Date:From:MIME-Version:To:CC:Subject:References: Content-Type:Content-Transfer-Encoding; b=m11uMQTvldLhMxr1yPZlbBrpSOSc6HrYqYl30IYQaGOlKZq5toYhB44HstAu4Cstq iMAVnWQDvN+AkDgBto3NgdINjMjzpMp86isEQ1SuTfbnzGKAs0YwkF3hF1RWVch4ww eACsxVi/EM1esJKQ0WXLjYa0+1kbOa1SFX3L8L3c=
Received: from [172.28.0.136] (unknown [41.197.16.250]) by mail.nottheoilrig.com (Postfix) with ESMTPSA; Sat, 19 May 2012 07:48:48 +0000 (UTC)
Message-ID: <4FB75146.1060609@nottheoilrig.com>
Date: Sat, 19 May 2012 00:52:38 -0700
From: Jack Bates <jzej8k@nottheoilrig.com>
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:11.0) Gecko/20120327 Thunderbird/11.0.1
MIME-Version: 1.0
To: ietf-http-wg@w3.org
CC: Anthony Bryan <anthonybryan@gmail.com>, Leif Hedstrom <zwoop@apache.org>
References: %3CCANqTPeivxKNJD0pzyGWWeer-4fxKpKU_zAp+7WrheizukaEEGg@mail.gmail.com%3E
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
Received-SPF: pass client-ip=50.16.249.74; envelope-from=jzej8k@nottheoilrig.com; helo=mail.nottheoilrig.com
X-W3C-Hub-Spam-Status: No, score=-1.2
X-W3C-Hub-Spam-Report: BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_SORBS_WEB=0.77, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01
X-W3C-Scan-Sig: maggie.w3.org 1SVePV-0007Sk-If 4118788f66b38c64bfe7d32d4f74b87d
X-Original-To: ietf-http-wg@w3.org
Subject: Forward proxies and CDN/mirrors
Archived-At: <http://www.w3.org/mid/4FB75146.1060609@nottheoilrig.com>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/13546
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>
Resent-Message-Id: <E1SVePk-0007cp-LI@frink.w3.org>
Resent-Date: Sat, 19 May 2012 07:49:24 +0000

Hello, I am curious to know the current thinking on HTTP forward proxies 
and content distribution networks, or download mirrors. What techniques 
are used to help forward proxies and content distribution networks play 
well together? What facilities are available in the HTTP protocol for 
this? What resources are available from the broader community of 
standards and best practices?

The approach that I am currently pursuing is to use RFC 6249, 
Metalink/HTTP: Mirrors and Hashes. For those content distribution 
networks that support it, our forward proxy listens for responses that 
are an HTTP redirect and have "Link: <...>; rel=duplicate" headers. If 
the URL in the "Location: ..." header is not already cached then we scan 
"Link: <...>; rel=duplicate" headers for a URL that is already cached 
and if found, we rewrite the "Location: ..." header with this URL

I would be very grateful for any feedback on this approach. What are the 
problems with this strategy? What are the alternatives? How does it 
relate to the letter or spirit of web architecture?

We are also thinking of using RFC 3230, Instance Digests in HTTP. Our 
proxy would listen for HTTP redirect responses that had "Digest: ..." 
headers. If the URL in the "Location: ..." header were not already 
cached then we would check if other content with the same digest were 
already cached. If so then we would rewrite the "Location: ..." header 
with the corresponding URL

The issue of forward proxies and content distribution networks is 
important to us because we run a caching proxy here at a rural village 
in Rwanda. Many web sites that distribute files present users with a 
simple download button that redirects to a download mirror, but they do 
not predictably redirect to the same mirror, or to a mirror that we 
already cached, so users can't predict whether a download will take 
seconds or hours, which is frustrating

Here is a proof of concept plugin [1] for the Apache Traffic Server open 
source caching proxy. It works just enough that given a response with a 
"Location: ..." header that is not already cached and a "Link: <...>; 
rel=duplicate" header that is already cached, it will replace the URL in 
the "Location: ..." header with the cached URL

I am working on this as part of the Google Summer of Code

[1] https://github.com/jablko/dedup