[Asrg] The difference between spam corpora

Martijn Grooten <martijn.grooten@virusbtn.com> Mon, 07 September 2009 16:05 UTC

Return-Path: <martijn.grooten@virusbtn.com>
X-Original-To: asrg@core3.amsl.com
Delivered-To: asrg@core3.amsl.com
Received: from localhost (localhost []) by core3.amsl.com (Postfix) with ESMTP id 5F61928C1CB for <asrg@core3.amsl.com>; Mon, 7 Sep 2009 09:05:17 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.999
X-Spam-Status: No, score=-3.999 tagged_above=-999 required=5 tests=[BAYES_50=0.001, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([]) by localhost (core3.amsl.com []) (amavisd-new, port 10024) with ESMTP id 7vCvJPkpp8Km for <asrg@core3.amsl.com>; Mon, 7 Sep 2009 09:05:16 -0700 (PDT)
Received: from pmx1.sophos.com (pmx1.sophos.com []) by core3.amsl.com (Postfix) with ESMTP id 53ABB28C1A4 for <asrg@irtf.org>; Mon, 7 Sep 2009 09:05:15 -0700 (PDT)
Received: from pmx1.sophos.com (localhost.localdomain []) by localhost (Postfix) with SMTP id E43B4336676 for <asrg@irtf.org>; Mon, 7 Sep 2009 17:05:38 +0100 (BST)
Received: from uk-exch2.green.sophos (uk-exch2.green.sophos []) by pmx1.sophos.com (Postfix) with ESMTP id D605C336657 for <asrg@irtf.org>; Mon, 7 Sep 2009 17:05:38 +0100 (BST)
Received: from UK-EXCHMBX1.green.sophos ([fe80:0000:0000:0000:e1bd:d3c1:]) by uk-exch2.green.sophos ([]) with mapi; Mon, 7 Sep 2009 17:04:36 +0100
From: Martijn Grooten <martijn.grooten@virusbtn.com>
To: Anti-Spam Research Group - IRTF <asrg@irtf.org>
Date: Mon, 7 Sep 2009 17:05:33 +0100
Thread-Topic: The difference between spam corpora
Thread-Index: Acov1QfWj2YaSVIiRFeRtAWc1KHTUQ==
Message-ID: <18B53BA2A483AD45962AAD1397BE132535FB6A18@UK-EXCHMBX1.green.sophos>
Accept-Language: en-US, en-GB
Content-Language: en-US
x-cr-hashedpuzzle: AvYJ A1ml BA7E CxoS CzsX Dhsv FJTP FLas F122 F3D5 GLiM GeGO G1Q8 JUHZ Jy8p KrN/; 1; YQBzAHIAZwBAAGkAcgB0AGYALgBvAHIAZwA=; Sosha1_v1; 7; {E83B82EB-432C-49BD-B5DA-B3A6CC7F57DC}; bQBhAHIAdABpAGoAbgAuAGcAcgBvAG8AdABlAG4AQAB2AGkAcgB1AHMAYgB0AG4ALgBjAG8AbQA=; Mon, 07 Sep 2009 16:05:33 GMT; VABoAGUAIABkAGkAZgBmAGUAcgBlAG4AYwBlACAAYgBlAHQAdwBlAGUAbgAgAHMAcABhAG0AIABjAG8AcgBwAG8AcgBhAA==
x-cr-puzzleid: {E83B82EB-432C-49BD-B5DA-B3A6CC7F57DC}
acceptlanguage: en-US, en-GB
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: [Asrg] The difference between spam corpora
X-BeenThere: asrg@irtf.org
X-Mailman-Version: 2.1.9
Precedence: list
Reply-To: Anti-Spam Research Group - IRTF <asrg@irtf.org>
List-Id: Anti-Spam Research Group - IRTF <asrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/listinfo/asrg>, <mailto:asrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/asrg>
List-Post: <mailto:asrg@irtf.org>
List-Help: <mailto:asrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/asrg>, <mailto:asrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Mon, 07 Sep 2009 16:05:17 -0000


I have been wondering if any research has been done about the difference between different (kinds of) spam corpora*; I believe this is the right place to ask. (Oh, and hello, I am kind of new here too; a lurker for quite some time, but not sure if I've posted before.)

* throughout this email, by corpus I mean all emails in a live mail stream, used in real time.

To test a spam filter, or an anti-spam method or to do research about spam, it is inevitable to use a spam corpus. As the spam sent to one email address, or even one corporation, is unlikely to be representative of all the spam sent globally during that period, most people add the spam sent to one or more spam traps to their test. There is nothing wrong with approach, but, at least in theory, a lot of spam will not end up in such traps: mailings sent by dodgy ESPs; spam sent to addresses harvested from Outlook address books; spam sent to addresses obtained by hacking a company's customer database (or, perhaps more likely here in the UK, spam sent to addresses from a CD-Rom found on a train).

I am not sure how big a proportion of spam is of this latter kind, but I think it would be interesting to find out. Over the past months I have sent both our corporate mail stream and the spam from a distributed spam trap through a number of spam filters and the difference in performance was striking, with many products letting through ten or more times as much corportate spam as spam trap spam. Now easy-to-filter is just one way of quantifying a difference between spam corpora, but these results have led me to believe that spam traps, much as they are extremely useful, don't show the full picture.


Virus Bulletin Ltd, The Pentagon, Abingdon, OX14 3YP, England.
Company Reg No: 2388295. VAT Reg No: GB 532 5598 33.