Re: FYI: Tools to evaluate header compression algorithms

Mark Nottingham <mnot@mnot.net> Mon, 07 January 2013 00:39 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 74AA121F84F9 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sun, 6 Jan 2013 16:39:14 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.58
X-Spam-Level:
X-Spam-Status: No, score=-7.58 tagged_above=-999 required=5 tests=[AWL=3.019, BAYES_00=-2.599, RCVD_IN_DNSWL_HI=-8]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ponh32uOuISc for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sun, 6 Jan 2013 16:39:13 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id E76DE21F84F8 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Sun, 6 Jan 2013 16:39:04 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1Ts0iu-0004TF-I3 for ietf-http-wg-dist@listhub.w3.org; Mon, 07 Jan 2013 00:37:52 +0000
Resent-Date: Mon, 07 Jan 2013 00:37:52 +0000
Resent-Message-Id: <E1Ts0iu-0004TF-I3@frink.w3.org>
Received: from maggie.w3.org ([128.30.52.39]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <mnot@mnot.net>) id 1Ts0iq-0004Sa-La for ietf-http-wg@listhub.w3.org; Mon, 07 Jan 2013 00:37:48 +0000
Received: from mxout-08.mxes.net ([216.86.168.183]) by maggie.w3.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.72) (envelope-from <mnot@mnot.net>) id 1Ts0ip-0007qm-MY for ietf-http-wg@w3.org; Mon, 07 Jan 2013 00:37:48 +0000
Received: from mnot-mini.mnot.net (unknown [118.209.230.64]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTPSA id 39EB2509B5; Sun, 6 Jan 2013 19:37:24 -0500 (EST)
Content-Type: text/plain; charset="iso-8859-1"
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
From: Mark Nottingham <mnot@mnot.net>
In-Reply-To: <CAKRe7JF3hca6suaN=Jbzd5tiH-wyzY_eS1pwurKHh_mWZb+4Kw@mail.gmail.com>
Date: Mon, 07 Jan 2013 11:37:20 +1100
Cc: Roberto Peon <grmocg@gmail.com>, "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, "ietf-http-wg@w3.org Group" <ietf-http-wg@w3.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <04188DEF-7671-4434-96BA-C64D9A1BEA1D@mnot.net>
References: <B7943590-9C82-4B5B-B084-89347B9B7D6A@mnot.net> <A7C46E35-DFC2-4E7B-A41A-5074ACEBA31C@mnot.net> <50E93B9B.1080506@it.aoyama.ac.jp> <CAP+FsNeKot9cuZprj3aeZ3HF_H8-E40OT98x-Vh7F6HoRFr_RA@mail.gmail.com> <CAKRe7JF3hca6suaN=Jbzd5tiH-wyzY_eS1pwurKHh_mWZb+4Kw@mail.gmail.com>
To: Ilya Grigorik <ilya@igvita.com>
X-Mailer: Apple Mail (2.1499)
Received-SPF: pass client-ip=216.86.168.183; envelope-from=mnot@mnot.net; helo=mxout-08.mxes.net
X-W3C-Hub-Spam-Status: No, score=-3.2
X-W3C-Hub-Spam-Report: AWL=-3.179, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001
X-W3C-Scan-Sig: maggie.w3.org 1Ts0ip-0007qm-MY a6f6f17f82e0cd8be2ed7a65ce7a1fc2
X-Original-To: ietf-http-wg@w3.org
Subject: Re: FYI: Tools to evaluate header compression algorithms
Archived-At: <http://www.w3.org/mid/04188DEF-7671-4434-96BA-C64D9A1BEA1D@mnot.net>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/15799
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

On 07/01/2013, at 7:34 AM, Ilya Grigorik <ilya@igvita.com> wrote:

> On Sun, Jan 6, 2013 at 1:55 AM, Roberto Peon <grmocg@gmail.com> wrote:
> Do you have some suggestions Martin?
> The obvious thing in my mind is to get submissions from site owners, but that takes interest on their part first. :/
> 
> HTTP Archive is now scanning ~300K top domains (at least according to
> Alexa). While its still "top site" biased, I think that's a pretty good
> sample to work with. I believe we should be able to get the HAR files from
> it.

That would be one good source, although it's just to the "top" page of each site. If someone wants to own talking to Steve and getting the HARs in suitable shape for a pull request, that'd be much appreciated.

I have a set of about 17 million links (to about 2 million distinct sites) that's more representative; it's sourced from a Wikipedia dump. Generating single-pageview HARs from them should be pretty straightforward, and it looks like getting a HAR from multiple navigations using PhantomJS is very doable: <https://github.com/ariya/phantomjs/wiki/Network-Monitoring>.

Cheers,


--
Mark Nottingham   http://www.mnot.net/