Re: [Asrg] Two ways to look at spam

Andrew Akehurst <A.D.Akehurst-99@student.lboro.ac.uk> Wed, 02 July 2003 22:13 UTC

Received: from optimus.ietf.org (ietf.org [132.151.1.19] (may be forged)) by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA24499 for <asrg-archive@odin.ietf.org>; Wed, 2 Jul 2003 18:13:38 -0400 (EDT)
Received: from localhost.localdomain ([127.0.0.1] helo=www1.ietf.org) by optimus.ietf.org with esmtp (Exim 4.20) id 19Xpqi-0001I1-JE for asrg-archive@odin.ietf.org; Wed, 02 Jul 2003 18:13:14 -0400
Received: (from exim@localhost) by www1.ietf.org (8.12.8/8.12.8/Submit) id h62MDCdf004951 for asrg-archive@odin.ietf.org; Wed, 2 Jul 2003 18:13:12 -0400
Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by optimus.ietf.org with esmtp (Exim 4.20) id 19Xpqi-0001Hl-FW for asrg-web-archive@optimus.ietf.org; Wed, 02 Jul 2003 18:13:12 -0400
Received: from ietf-mx (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA24369; Wed, 2 Jul 2003 18:13:05 -0400 (EDT)
Received: from ietf-mx ([132.151.6.1]) by ietf-mx with esmtp (Exim 4.12) id 19Xpqd-0002r2-00; Wed, 02 Jul 2003 18:13:07 -0400
Received: from ietf.org ([132.151.1.19] helo=optimus.ietf.org) by ietf-mx with esmtp (Exim 4.12) id 19Xpqd-0002qz-00; Wed, 02 Jul 2003 18:13:07 -0400
Received: from localhost.localdomain ([127.0.0.1] helo=www1.ietf.org) by optimus.ietf.org with esmtp (Exim 4.20) id 19XpqX-0001EW-A6; Wed, 02 Jul 2003 18:13:01 -0400
Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by optimus.ietf.org with esmtp (Exim 4.20) id 19XpqI-0001Ds-MV for asrg@optimus.ietf.org; Wed, 02 Jul 2003 18:12:46 -0400
Received: from ietf-mx (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA24276 for <asrg@ietf.org>; Wed, 2 Jul 2003 18:12:40 -0400 (EDT)
Received: from ietf-mx ([132.151.6.1]) by ietf-mx with esmtp (Exim 4.12) id 19XpqE-0002p3-00 for asrg@ietf.org; Wed, 02 Jul 2003 18:12:42 -0400
Received: from bill.lut.ac.uk ([158.125.1.193]) by ietf-mx with esmtp (Exim 4.12) id 19XpqC-0002ok-00 for asrg@ietf.org; Wed, 02 Jul 2003 18:12:40 -0400
Received: from [158.125.1.117] (helo=studentpop1.lboro.ac.uk ident=root) by bill.lut.ac.uk with esmtp (Exim 4.14) id 19Xpq5-0007AT-Ev for asrg@ietf.org; Wed, 02 Jul 2003 23:12:33 +0100
Received: from [158.125.1.122] (helo=bod.lut.ac.uk) by studentpop1.lboro.ac.uk with esmtp (Exim 3.13 #1) id 19Xpq5-00011q-00 for asrg@ietf.org; Wed, 02 Jul 2003 23:12:33 +0100
Received: from apache by bod.lut.ac.uk with local (Exim 4.12) id 19Xpq5-0001xS-00 for asrg@ietf.org; Wed, 02 Jul 2003 23:12:33 +0100
To: asrg@ietf.org
Subject: Re: [Asrg] Two ways to look at spam
Message-ID: <1057183953.3f0358d15c92d@student-webmail.lboro.ac.uk>
From: Andrew Akehurst <A.D.Akehurst-99@student.lboro.ac.uk>
References: <20030702160005.7369.17123.Mailman@www1.ietf.org>
In-Reply-To: <20030702160005.7369.17123.Mailman@www1.ietf.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
User-Agent: IMP/PHP IMAP webmail program 2.2.8
X-Originating-IP: 172.185.57.212
X-Spam-Score: -19.2 (-------------------)
X-Scanner: exiscan for exim4 (http://duncanthrax.net/exiscan/) *19Xpq5-0007AT-Ev*PHgncWXEsHk*
X-Lboro-Filtered: bill.lut.ac.uk, Wed, 02 Jul 2003 23:12:33 +0100
Content-Transfer-Encoding: 8bit
Sender: asrg-admin@ietf.org
Errors-To: asrg-admin@ietf.org
X-BeenThere: asrg@ietf.org
X-Mailman-Version: 2.0.12
Precedence: bulk
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/asrg>, <mailto:asrg-request@ietf.org?subject=unsubscribe>
List-Id: Anti-Spam Research Group - IRTF <asrg.ietf.org>
List-Post: <mailto:asrg@ietf.org>
List-Help: <mailto:asrg-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/asrg>, <mailto:asrg-request@ietf.org?subject=subscribe>
List-Archive: <https://www1.ietf.org/pipermail/asrg/>
Date: Wed, 02 Jul 2003 22:12:33 +0000
Content-Transfer-Encoding: 8bit
Content-Transfer-Encoding: 8bit

> "Jon Kyme" <jrk@merseymail.com> writes:

> >> The former would be useful, but I'm doubtful that it would have much
> >> of an impact on spam.  The latter seems to me to rely on the sender
> >> accurately tagging their messages according to content---possibly
> >> would happen often enough that it would be worthwhile, but I'm not
> >> sure that it would.

> I'm not sure about this, there seems to me (at the most general) to
> be only one class of things that need be asserted in a consent
> expression: How this message is classified by some engine. Your
> second class seems to me to be the sort of thing that's routinely
> handled by content-filters (imperfectly, I grant you).
>
> So rather than saying:
> 1. message has html => noconsent
> 2. message mentions 'septic tank enhancement' => consent
> 3. message is from grandma => consent
> 4. message has valid consent token => consent
> 5. message has blacklisted source IP => noconsent
> etc ...
>
> You might say something more like
> positive_test(name_of_engine_1, engineargs, message) => noconsent 
> positive_test(name_of_engine_2, engineargs, message) => consent 
> etc...
 
> I guess someone could standardise this (using whatever language they
> wanted), and there are some kinds of content filter (probably quite
> simple things---the sort of thing that SIEVE can do, say) that we
> could standardise on.  That might be useful.

Here's a little idea I had which was inspired by the above. 

I'd like to apologise for pre-empting the forthcoming consent framework 
document (was it Yakov who was working on that?) but I wanted to write all of 
this down before I forget. Feel free to tear it to shreds, although 
constructive suggestions for improvement would also be nice. :-)

One approach that seems to work well for packet filtering is the iptables 
format of rules used by the Linux Netfilter module (http://www.netfilter.org). 
Perhaps a similar structure could be applied here to each e-mail message?

You could have some kind of list of rules against which an e-mail message is 
compared in sequence until it matches a rule which specifies some policy 
decision. 

The Netfilter architecture allows each rule to have an associated external 
module to evaluate a match (e.g. the "mac" module to match packets based on the 
interface's MAC address is specified using "--match mac") so that it is fully 
extensible.

For each rule there is also either a destination decision which specifies the 
fate of any packet matching that rule or else the name of another table of 
rules to be applied in the same way. Netfilter's ability to combine tables of 
tables using jumps and RETURNs allows one to construct very powerful 
combinations of rules.

Message matching modules could be supplied by a range of different 
companies/programmers and the local user (if this is done in their MUA) or else 
the site admin (for a MTA) could utilise whichever modules they prefer at their 
level. Thus there might be a module to implement DNS blacklisting, one for some 
kind of C/R, a module for digital signature checking, another for content-based 
filtering and so on.

Typical destination outcomes for an e-mail message might be:
 - silently discard the message (analogous to Netfilter's DROP)
 - bounce the message back with an error (like Netfilter's REJECT)
 - accept the message for delivery (like Netfilter's ACCEPT)
 - log part or all of the message for use in spam statistics and abuser tracing 
(like Netfilter's LOG, processing need not terminate after doing this)

There might be other policy options too, this isn't intended to be an 
exhaustive list. Just as with Netfilter, the table will need some kind of 
default policy for messages that don't match any of the rules listed. Users 
could choose a fail-open (ACCEPT) or fail-closed (DROP) approach depending on 
their preferences.

So using a pseudo-Netfilter syntax, my spam filtering INPUT table might look 
something like this:

  --source my_mum@aol.com -j ACCEPT

  --match content --content-type text/html --contains JavaScript -j DROP

  --match content --content-type text/html --contains InvalidHTMLTags -j DROP

  --source friend@somewhere.net --match attachment
      --type EXE,SCR,PIF,BAT,VBS -j REJECT

  --match attachment --file-type EXE,SCR,PIF,BAT,COM -j DROP

  --source trusted_colleague@work.com --match attachment
      --file-type JPEG,GIF -j ACCEPT

  --match content --content-type text/plain -j ACCEPT

... with a default policy of DROP for anything else.

Notice that I'm willing to send a rejection message to one of my friends to 
bounce a message back, as it seems only polite to warn them. But for most 
senders I would silently discard suspected spam in order to avoid giving away 
the fact that my address is valid and thus incurring more.

This is of course merely an example of how I might specify my personal 
preferences. I'm not suggesting that anyone else should set theirs this way, 
nor would I presume to tell other people what their default should be. I think 
it should be entirely at the recipient's discretion as to what they choose to 
receive. As somebody who knows more about e-mail than the average user, I'm 
prepared to accept the risk of some genuine messages being dropped provided I 
can trust the rules and filter modules I'm using. But this is just my own 
preference.

Incidentally I know the above syntax is ugly and unfriendly to end-users. It's 
just an example, of course. However it can be made much easier by providing 
simple forms or graphical tools for the user in order to generate the rules on 
their behalf. If this were integrated into the MUA and tied in with their 
address book it would be extremely simple to use.

To simplify usability further, each site (organisation or ISP) might provide a 
series of default policies, ranging from "high spam protection" to "no spam 
protection". Users could choose a level based on how strongly they feel about 
the issue.

One suggestion I've not seen so far is rather like the Internet Explorer 
classification of web sites into "zones" depending on their level of trust. 
Users might classify senders into zones in the MUA address book, or else define 
some rules which can map an individual message into a "zone". Then a default 
policy is provided for every zone (which advanced users are free to tweak and 
define their own "custom" settings). Of course it's hard to classify e-mail 
messages into such zones because it's difficult to determine their true origin, 
so perhaps that idea would be unworkable. Anyway this is just an aside, not 
essential to the idea I'm describing.

One other thing occurred to me based on the Netfilter comparison. Netfilter has 
several tables of rules based on what stage of routing the packet has arrived 
at. So there is a PREROUTING table for filtering packets before they've 
undergone any NAT translation/mangling, then either the INPUT or FORWARD table 
(depending on the routing decision) gets a chance to process them again after 
translation.

Under an analogous e-mail filtering system, one could define tables of rules as 
follows:

 - ARRIVAL
 - INPUT
 - FORWARD

ARRIVAL - for rules to process a message at the time at which the SMTP client 
connection is open, before the message has been accepted and enqueued. Thus at 
some suitable point before issuing a "250 Mail queued for delivery" we check 
the message against the rules and decide what to do with it. The ARRIVAL rule 
table will probably only apply to MTAs.

A decision to ACCEPT would be like a "250 Mail queued" kind of message.
A decision to DROP might map onto a "250 Mail queued" reply code but where the 
message is then dropped without placing it into the mail queue.
A decision to REJECT might map onto a "5xy" permanent refusal reply code.

... and so on.

Other types of policy might provide for a DELAY, perhaps a transient "4xy" 
refusal code. The delay policy might be used to limit the rate at which 
suspected spam spreads around the net, by forcing it to remain in the previous 
MTA's queue. Of course this would likely require some stateful tracking of 
messages so it might be difficult in practice. I'll leave that one for the 
experts to decide. 

There might be other kinds of policy too... I don't want to be too prescriptive 
at this stage.

Rules in the ARRIVAL chain might also add headers to the messages. For example, 
the addition of "X-SpamScore" headers by spam filter modules might be done 
here. 

INPUT - for a message which has been accepted by the ARRIVAL table and which is 
destinated for local delivery. The ISP's MTA might have a special set of rules 
which it only applies to mailboxes on its own servers; these rules could be 
entered here.

Also, a user's MUA could have its own additional INPUT table for messages which 
the ISP has not blocked. A user might thus implement their own spam policy on 
their local machine in the event that their ISP's generic policy isn't good 
enough for them. My personal example above would fit in here, filtering 
messages as they are downloaded from the server (via POP3, IMAP or whatever 
protocol they like. I think IMAP has some interesting possibilities in its own 
right but that's an aside).

FORWARD - for a message which has been accepted by the ARRIVAL table of an MTA 
but which is to be passed on elsewhere to some server which is outside the 
control of the organisation running this MTA.

Rules included here might be based on reciprocal spam-blocking agreements 
between the organisations which operate MTAs (ISPs, companies, whoever). There 
might be scope for some kind of social co-operation here by favouring messages 
to/from well-behaved ISPs. Protocols for sharing/co-ordinating policy might be 
used to keep FORWARD rules up-to-date.

As a simple example of an extreme, a honeypot server which traps spam but never 
delivers might have a FORWARD table consisting of a single LOG rule to keep a 
copy of the message for later analysis and a default DROP policy so nothing 
gets externally delivered.

Of course I've not said anything about how the modules would communicate with 
such a system, nor about how (if at all) individual modules might communicate 
with each other. Suggestions or comments about that aspect of the system would 
be helpful to flesh out some detail.

> It's not a solution to spam, though, because some things really are
> things that can't be checked automatically, so the content filtering
> will be imperfect.  And (if it were to be standardised) we can expect
> it to become more and more imperfect.

Maybe filtering will be good enough for long enough to allow time to deploy a 
better long-term solution. At any rate it can give everyone some breathing 
space.

Sorry this turned into such a huge message, I probably got carried away. Things 
I like about this idea are:

  - it could express consent policy at different stages in the network, as 
discussed previously on the list. There is scope for some mechanism (protocol?) 
to synchronise or distribute policies across the internet. I will leave the 
possible design of this to other people.

  - it is independent of the SMTP transport protocol and therefore does not 
require changes to the SMTP standard (of course, policy decisions will need to 
be mapped onto SMTP response codes somehow).

  - it doesn't need to be deployed by the whole world at once in order to reap 
benefits, just as Gordon has pointed out about his personal consent system. 
I've tried to show how some of his suggested categories of things might be 
matched and blocked in my INPUT example above

Things that worry me about it:
  - it assumes that there are suitable filtering techniques (modules) available 
to fit into this system

  - it needs the writers of MUAs to get on-board and for users to upgrade to 
newer, more capable MUAs. I suspect that the users who hate spam enough would 
be quite willing to download better software if their ISP held them by the 
hand. Those who don't upgrade will still be able to communicate with the rest 
of the world, subject to the spam detection policies of their recipients

  - it would work best if MTAs were also redesigned to employ the scheme. Of 
course the interoperability means it can be deployed on a small scale first and 
gradually increased

Thanks for reading...

Andrew

_______________________________________________
Asrg mailing list
Asrg@ietf.org
https://www1.ietf.org/mailman/listinfo/asrg