Re: [Asrg] "Uncaught spam" research project

Martijn Grooten <martijn.grooten@virusbtn.com> Tue, 04 May 2010 12:15 UTC

From: Martijn Grooten <martijn.grooten@virusbtn.com>
To: "asrg@irtf.org" <asrg@irtf.org>
Date: Tue, 04 May 2010 13:15:06 +0100
Thread-Topic: [Asrg] "Uncaught spam" research project
Thread-Index: AcrolsLiIFMyYmpJR6iIX1LBqFTycwC6ZLSg
Message-ID: <18B53BA2A483AD45962AAD1397BE132537A28C4EDF@UK-EXCHMBX1.green.sophos>
References: <18B53BA2A483AD45962AAD1397BE1325379ED80C30@UK-EXCHMBX1.green.sophos> <4BDB279A.1050606@billmail.scconsult.com>
In-Reply-To: <4BDB279A.1050606@billmail.scconsult.com>
Accept-Language: en-US, en-GB
Content-Language: en-US
acceptlanguage: en-US, en-GB
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [Asrg] "Uncaught spam" research project
Precedence: list
Reply-To: Anti-Spam Research Group - IRTF <asrg@irtf.org>

Bill Cole wrote:
> That skews your sample quite a bit. A significant fraction of the hard
> cases
> these days are CAN-SPAM compliant campaigns sent by well-meaning
> originators
> using mostly-whitehat ESP's. The vast majority of spam is nothing like
> that,
> but the vast majority is also fairly easy to shun for all but the
> largest
> and cheapest providers. The spam that ends up in a real user's normal
> Inbox
> (where it is most annoying to them) is unlikely to ever hit a trap
> address.
> On the other side, traps vary in provenance so there is a lot of
> variation
> in what they get but as a direct corollary of the fact that you can
> call an
> address a trap, the mail it is sent will be coming from less careful
> and
> clueful spammers. That tilt won't be the same for every sort of trap,
> but it
> will be there for all traps to some degree.
>
> That does not make your research pointless, but it is important to
> understand the skew and the subjective "cost" of different sorts of
> uncaught
> spam in applying whatever you find.

Oh, I absolutely agree. There are many reasons why I want to concentrate on spam-trap spam (perhaps the most important one being that for these messages it's so much easier to decide whether they are actually both spam and unwanted -- they are both more or less by definition), but I am well aware that it only covers part of the spam. And a part that's already easy to filter (but, arguably, also a part where not filtering, especially in the case of phishing, can be more dangerous). I might run the same project on a different spam corpus at some point in the future.

> Again, that is an issue that speaks primarily to interpretation and
> application of your results, rather than to the whole plan. If you
> don't
> have a skilled SA wrangler handy you cannot test the results of a
> customized
> and tuned SA deployment, and that is one of the strong arguments
> against
> using straight SA in the real world.

Good point. For this very reason I'm not testing SA as part of the comparative test; that just wouldn't be fair on it.

> > [3] I would love to include DKIM, but I can only distinguish between
> does
> > have and does not have a DKIM-signature; the redacting of emails to
> hide
> > the original recipient makes me unable to decide whether a present
> > signature was actually valid.
>
> Probably not a big loss in itself, as DKIM correlation with spamminess
> is
> very sensitive to the sort of mailstream one has, and in complex ways.

Actually, (almost) no spam trap spam I receive has a DKIM signature, valid or not.

> HOWEVER, this raises a very serious design pitfall. You need to make
> sure if
> you plan to feed redacted messages to filters that they can be made to
> ignore whatever redaction-spoor is in those messages. The simplest
> example
> is the one you give: broken signatures. A valid signature may correlate
> very
> poorly to validity of the mail while a bad signature may correlate
> quite
> well to invalidity. As a general rule, redaction for the purpose of
> hiding
> individual identity naturally tends to make all of the redacted
> messages a
> little bit more like each other in ways that may be obvious or may be
> subtle, and many filters are designed to look for patterns of
> similarity as
> evidence of spam.
>
> Ultimately I think it is so hard to be sure that you are avoiding
> significant effects from redaction that researching filters with
> redacted
> inputs is a total waste of time. I don't really understand the point of
> redaction for this anyway, since the addresses are traps. The case for
> recipient redaction may be plausible when spam is being reported in
> public
> or to untrusted parties, i.e. to protect individual privacy or the
> effects
> of disclosing a trap address. For filter testing with trap spam, the
> only
> risk of disclosure would be if a filter works in some collaborative
> fashion
> akin to DCC or Razor/Pyzor. If you are afraid of that resulting in
> harmful
> disclosure of your trap addresses, then you have an intractable
> problem. The
> GIGO principle applies, and any modification of your inputs from what a
> filter would see in the real world makes your inputs garbage.

I agree with you in theory. And I know one should be wary assuming things about spammers' behaviour.

However, all that happens when redacting the messages is something along the lines of

s/spamtrapdomain/mydomain/g

and

s/spamtrap-local-part/my-local-part/g

Apart from signatures -- which, in practise, turn out to be hardly there -- I don't see how this affects the setup.

Martijn.

Virus Bulletin Ltd, The Pentagon, Abingdon, OX14 3YP, England.
Company Reg No: 2388295. VAT Reg No: GB 532 5598 33.

[Asrg] "Uncaught spam" research project Martijn Grooten
Re: [Asrg] "Uncaught spam" research project John Leslie
Re: [Asrg] "Uncaught spam" research project Martijn Grooten
Re: [Asrg] "Uncaught spam" research project Aaron Wolfe
Re: [Asrg] "Uncaught spam" research project Bill Cole
Re: [Asrg] "Uncaught spam" research project Martijn Grooten
Re: [Asrg] "Uncaught spam" research project Martijn Grooten