Re: [Asrg] "Uncaught spam" research project

Martijn Grooten <martijn.grooten@virusbtn.com> Tue, 04 May 2010 12:15 UTC

Return-Path: <martijn.grooten@virusbtn.com>
X-Original-To: asrg@core3.amsl.com
Delivered-To: asrg@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 067713A69DA for <asrg@core3.amsl.com>; Tue, 4 May 2010 05:15:26 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 1.5
X-Spam-Level: *
X-Spam-Status: No, score=1.5 tagged_above=-999 required=5 tests=[AWL=-0.500, BAYES_80=2]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id T8OY7QVHCuk8 for <asrg@core3.amsl.com>; Tue, 4 May 2010 05:15:25 -0700 (PDT)
Received: from mx1.sophos.com (mx1.sophos.com [195.166.81.52]) by core3.amsl.com (Postfix) with ESMTP id AC3A43A68D0 for <asrg@irtf.org>; Tue, 4 May 2010 05:15:23 -0700 (PDT)
Received: from mx1.sophos.com (localhost.localdomain [127.0.0.1]) by localhost (Postfix) with SMTP id 211D0E78014 for <asrg@irtf.org>; Tue, 4 May 2010 13:15:09 +0100 (BST)
Received: from uk-exch1.green.sophos (uk-exch1.green.sophos [10.100.199.16]) by mx1.sophos.com (Postfix) with ESMTP id B4804E78004 for <asrg@irtf.org>; Tue, 4 May 2010 13:15:08 +0100 (BST)
Received: from UK-EXCHMBX1.green.sophos ([fe80:0000:0000:0000:e1bd:d3c1:23.222.229.221]) by uk-exch1.green.sophos ([10.100.199.16]) with mapi; Tue, 4 May 2010 13:15:08 +0100
From: Martijn Grooten <martijn.grooten@virusbtn.com>
To: "asrg@irtf.org" <asrg@irtf.org>
Date: Tue, 04 May 2010 13:15:06 +0100
Thread-Topic: [Asrg] "Uncaught spam" research project
Thread-Index: AcrolsLiIFMyYmpJR6iIX1LBqFTycwC6ZLSg
Message-ID: <18B53BA2A483AD45962AAD1397BE132537A28C4EDF@UK-EXCHMBX1.green.sophos>
References: <18B53BA2A483AD45962AAD1397BE1325379ED80C30@UK-EXCHMBX1.green.sophos> <4BDB279A.1050606@billmail.scconsult.com>
In-Reply-To: <4BDB279A.1050606@billmail.scconsult.com>
Accept-Language: en-US, en-GB
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
acceptlanguage: en-US, en-GB
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [Asrg] "Uncaught spam" research project
X-BeenThere: asrg@irtf.org
X-Mailman-Version: 2.1.9
Precedence: list
Reply-To: Anti-Spam Research Group - IRTF <asrg@irtf.org>
List-Id: Anti-Spam Research Group - IRTF <asrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/listinfo/asrg>, <mailto:asrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/asrg>
List-Post: <mailto:asrg@irtf.org>
List-Help: <mailto:asrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/asrg>, <mailto:asrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Tue, 04 May 2010 12:15:26 -0000

Bill Cole wrote:
> That skews your sample quite a bit. A significant fraction of the hard
> cases
> these days are CAN-SPAM compliant campaigns sent by well-meaning
> originators
> using mostly-whitehat ESP's. The vast majority of spam is nothing like
> that,
> but the vast majority is also fairly easy to shun for all but the
> largest
> and cheapest providers. The spam that ends up in a real user's normal
> Inbox
> (where it is most annoying to them) is unlikely to ever hit a trap
> address.
> On the other side, traps vary in provenance so there is a lot of
> variation
> in what they get but as a direct corollary of the fact that you can
> call an
> address a trap, the mail it is sent will be coming from less careful
> and
> clueful spammers. That tilt won't be the same for every sort of trap,
> but it
> will be there for all traps to some degree.
>
> That does not make your research pointless, but it is important to
> understand the skew and the subjective "cost" of different sorts of
> uncaught
> spam in applying whatever you find.

Oh, I absolutely agree. There are many reasons why I want to concentrate on spam-trap spam (perhaps the most important one being that for these messages it's so much easier to decide whether they are actually both spam and unwanted -- they are both more or less by definition), but I am well aware that it only covers part of the spam. And a part that's already easy to filter (but, arguably, also a part where not filtering, especially in the case of phishing, can be more dangerous). I might run the same project on a different spam corpus at some point in the future.

> Again, that is an issue that speaks primarily to interpretation and
> application of your results, rather than to the whole plan. If you
> don't
> have a skilled SA wrangler handy you cannot test the results of a
> customized
> and tuned SA deployment, and that is one of the strong arguments
> against
> using straight SA in the real world.

Good point. For this very reason I'm not testing SA as part of the comparative test; that just wouldn't be fair on it.

> > [3] I would love to include DKIM, but I can only distinguish between
> does
> > have and does not have a DKIM-signature; the redacting of emails to
> hide
> > the original recipient makes me unable to decide whether a present
> > signature was actually valid.
>
> Probably not a big loss in itself, as DKIM correlation with spamminess
> is
> very sensitive to the sort of mailstream one has, and in complex ways.

Actually, (almost) no spam trap spam I receive has a DKIM signature, valid or not.

> HOWEVER, this raises a very serious design pitfall. You need to make
> sure if
> you plan to feed redacted messages to filters that they can be made to
> ignore whatever redaction-spoor is in those messages. The simplest
> example
> is the one you give: broken signatures. A valid signature may correlate
> very
> poorly to validity of the mail while a bad signature may correlate
> quite
> well to invalidity. As a general rule, redaction for the purpose of
> hiding
> individual identity naturally tends to make all of the redacted
> messages a
> little bit more like each other in ways that may be obvious or may be
> subtle, and many filters are designed to look for patterns of
> similarity as
> evidence of spam.
>
> Ultimately I think it is so hard to be sure that you are avoiding
> significant effects from redaction that researching filters with
> redacted
> inputs is a total waste of time. I don't really understand the point of
> redaction for this anyway, since the addresses are traps. The case for
> recipient redaction may be plausible when spam is being reported in
> public
> or to untrusted parties, i.e. to protect individual privacy or the
> effects
> of disclosing a trap address. For filter testing with trap spam, the
> only
> risk of disclosure would be if a filter works in some collaborative
> fashion
> akin to DCC or Razor/Pyzor. If you are afraid of that resulting in
> harmful
> disclosure of your trap addresses, then you have an intractable
> problem. The
> GIGO principle applies, and any modification of your inputs from what a
> filter would see in the real world makes your inputs garbage.

I agree with you in theory. And I know one should be wary assuming things about spammers' behaviour.

However, all that happens when redacting the messages is something along the lines of

s/spamtrapdomain/mydomain/g

and

s/spamtrap-local-part/my-local-part/g

Apart from signatures -- which, in practise, turn out to be hardly there -- I don't see how this affects the setup.

Martijn.

Virus Bulletin Ltd, The Pentagon, Abingdon, OX14 3YP, England.
Company Reg No: 2388295. VAT Reg No: GB 532 5598 33.