Re: [Asrg] "Uncaught spam" research project

Bill Cole <asrg3@billmail.scconsult.com> Fri, 30 April 2010 18:55 UTC

Return-Path: <asrg3@billmail.scconsult.com>
X-Original-To: asrg@core3.amsl.com
Delivered-To: asrg@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 351583A6BD9 for <asrg@core3.amsl.com>; Fri, 30 Apr 2010 11:55:39 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.001
X-Spam-Level:
X-Spam-Status: No, score=0.001 tagged_above=-999 required=5 tests=[BAYES_50=0.001]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id CD+hgvo5kl3B for <asrg@core3.amsl.com>; Fri, 30 Apr 2010 11:55:38 -0700 (PDT)
Received: from toaster.scconsult.com (www.scconsult.com [66.73.230.185]) by core3.amsl.com (Postfix) with ESMTP id E014F3A69B3 for <asrg@irtf.org>; Fri, 30 Apr 2010 11:55:37 -0700 (PDT)
Received: from bigsky.local (bigsky.scconsult.com [192.168.2.102]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by toaster.scconsult.com (Postfix) with ESMTP id 95C8BAAC6FF for <asrg@irtf.org>; Fri, 30 Apr 2010 14:55:23 -0400 (EDT)
Message-ID: <4BDB279A.1050606@billmail.scconsult.com>
Date: Fri, 30 Apr 2010 14:55:22 -0400
From: Bill Cole <asrg3@billmail.scconsult.com>
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.7) Gecko/20100301 Lightning/1.0b1 Eudora/3.0.1
MIME-Version: 1.0
To: asrg@irtf.org
References: <18B53BA2A483AD45962AAD1397BE1325379ED80C30@UK-EXCHMBX1.green.sophos>
In-Reply-To: <18B53BA2A483AD45962AAD1397BE1325379ED80C30@UK-EXCHMBX1.green.sophos>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 7bit
Subject: Re: [Asrg] "Uncaught spam" research project
X-BeenThere: asrg@irtf.org
X-Mailman-Version: 2.1.9
Precedence: list
Reply-To: asrg@irtf.org
List-Id: Anti-Spam Research Group - IRTF <asrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/listinfo/asrg>, <mailto:asrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/asrg>
List-Post: <mailto:asrg@irtf.org>
List-Help: <mailto:asrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/asrg>, <mailto:asrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Fri, 30 Apr 2010 18:55:39 -0000

Martijn Grooten wrote, On 4/30/10 10:37 AM:
[...]
>
> [1] Spam in the context of this email is spam sent to spam traps. So the
> real, proper spam, not the perhaps-not-100%-CAN-SPAM-compliant spam.

That skews your sample quite a bit. A significant fraction of the hard cases 
these days are CAN-SPAM compliant campaigns sent by well-meaning originators 
using mostly-whitehat ESP's. The vast majority of spam is nothing like that, 
but the vast majority is also fairly easy to shun for all but the largest 
and cheapest providers. The spam that ends up in a real user's normal Inbox 
(where it is most annoying to them) is unlikely to ever hit a trap address. 
On the other side, traps vary in provenance so there is a lot of variation 
in what they get but as a direct corollary of the fact that you can call an 
address a trap, the mail it is sent will be coming from less careful and 
clueful spammers. That tilt won't be the same for every sort of trap, but it 
will be there for all traps to some degree.

That does not make your research pointless, but it is important to 
understand the skew and the subjective "cost" of different sorts of uncaught 
spam in applying whatever you find.

> [2] Several of these make use of open source filters (e.g. SpamAssassin),
> so it's fair to say that most filters are covered.

Not so much. One real world distinction between vendorware wrapping SA and 
SA deployed consciously is that the former is often bought as an alternative 
to employing skilled staff, while the latter is likely to be a tool that is 
constantly being adjusted and enhanced by skilled staff. Sites vary in what 
spam they get, what non-spam they get, and FP tolerance. Spammers adapt 
somewhat over time to filtering tactics, especially to SA because it is the 
dominant open source filtering tool. A commercial filter built around SA is 
likely to use an older version with its ornate configurability configured in 
a manner so that by default it is safe for any site and exposed to local 
adjustment in only the simplest ways, while a well-managed SA deployment is 
likely to be kept current and to have departures from the distributed config 
defaults that would be intolerable for other sites.

Again, that is an issue that speaks primarily to interpretation and 
application of your results, rather than to the whole plan. If you don't 
have a skilled SA wrangler handy you cannot test the results of a customized 
and tuned SA deployment, and that is one of the strong arguments against 
using straight SA in the real world.

> [3] I would love to include DKIM, but I can only distinguish between does
> have and does not have a DKIM-signature; the redacting of emails to hide
> the original recipient makes me unable to decide whether a present
> signature was actually valid.

Probably not a big loss in itself, as DKIM correlation with spamminess is 
very sensitive to the sort of mailstream one has, and in complex ways.

HOWEVER, this raises a very serious design pitfall. You need to make sure if 
you plan to feed redacted messages to filters that they can be made to 
ignore whatever redaction-spoor is in those messages. The simplest example 
is the one you give: broken signatures. A valid signature may correlate very 
poorly to validity of the mail while a bad signature may correlate quite 
well to invalidity. As a general rule, redaction for the purpose of hiding 
individual identity naturally tends to make all of the redacted messages a 
little bit more like each other in ways that may be obvious or may be 
subtle, and many filters are designed to look for patterns of similarity as 
evidence of spam.

Ultimately I think it is so hard to be sure that you are avoiding 
significant effects from redaction that researching filters with redacted 
inputs is a total waste of time. I don't really understand the point of 
redaction for this anyway, since the addresses are traps. The case for 
recipient redaction may be plausible when spam is being reported in public 
or to untrusted parties, i.e. to protect individual privacy or the effects 
of disclosing a trap address. For filter testing with trap spam, the only 
risk of disclosure would be if a filter works in some collaborative fashion 
akin to DCC or Razor/Pyzor. If you are afraid of that resulting in harmful 
disclosure of your trap addresses, then you have an intractable problem. The 
GIGO principle applies, and any modification of your inputs from what a 
filter would see in the real world makes your inputs garbage.