Re: [Asrg] Hotmail used for scripted spam (from SlashDot)

"Kurt Magnusson" <kmn_asgr@hotmail.com> Wed, 11 June 2003 19:37 UTC

Received: from www1.ietf.org (ietf.org [132.151.1.19] (may be forged)) by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA26108 for <asrg-archive@odin.ietf.org>; Wed, 11 Jun 2003 15:37:04 -0400 (EDT)
Received: (from mailnull@localhost) by www1.ietf.org (8.11.6/8.11.6) id h5BJaaK28498 for asrg-archive@odin.ietf.org; Wed, 11 Jun 2003 15:36:36 -0400
Received: from ietf.org (odin.ietf.org [132.151.1.176]) by www1.ietf.org (8.11.6/8.11.6) with ESMTP id h5BJaam28495 for <asrg-web-archive@optimus.ietf.org>; Wed, 11 Jun 2003 15:36:36 -0400
Received: from ietf-mx (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA25828; Wed, 11 Jun 2003 15:36:34 -0400 (EDT)
Received: from ietf-mx ([132.151.6.1]) by ietf-mx with esmtp (Exim 4.12) id 19QBMb-000329-00; Wed, 11 Jun 2003 15:34:29 -0400
Received: from ietf.org ([132.151.1.19] helo=www1.ietf.org) by ietf-mx with esmtp (Exim 4.12) id 19QBMb-000325-00; Wed, 11 Jun 2003 15:34:29 -0400
Received: from www1.ietf.org (localhost.localdomain [127.0.0.1]) by www1.ietf.org (8.11.6/8.11.6) with ESMTP id h5BJVim28314; Wed, 11 Jun 2003 15:31:45 -0400
Received: from ietf.org (odin.ietf.org [132.151.1.176]) by www1.ietf.org (8.11.6/8.11.6) with ESMTP id h5BJTkm28235 for <asrg@optimus.ietf.org>; Wed, 11 Jun 2003 15:29:46 -0400
Received: from ietf-mx (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA23924 for <asrg@ietf.org>; Wed, 11 Jun 2003 15:29:43 -0400 (EDT)
Received: from ietf-mx ([132.151.6.1]) by ietf-mx with esmtp (Exim 4.12) id 19QBFy-0002zw-00 for asrg@ietf.org; Wed, 11 Jun 2003 15:27:38 -0400
Received: from bay2-f102.bay2.hotmail.com ([65.54.247.102] helo=hotmail.com) by ietf-mx with esmtp (Exim 4.12) id 19QBFy-0002zg-00 for asrg@ietf.org; Wed, 11 Jun 2003 15:27:38 -0400
Received: from mail pickup service by hotmail.com with Microsoft SMTPSVC; Wed, 11 Jun 2003 12:27:54 -0700
Received: from 192.85.16.3 by by2fd.bay2.hotmail.msn.com with HTTP; Wed, 11 Jun 2003 19:27:54 GMT
X-Originating-IP: [192.85.16.3]
X-Originating-Email: [kmn_asgr@hotmail.com]
From: Kurt Magnusson <kmn_asgr@hotmail.com>
To: asrg@ietf.org
Subject: Re: [Asrg] Hotmail used for scripted spam (from SlashDot)
Mime-Version: 1.0
Content-Type: text/plain; format="flowed"
Message-ID: <BAY2-F102jrGIWJiE9Y0000a464@hotmail.com>
X-OriginalArrivalTime: 11 Jun 2003 19:27:54.0593 (UTC) FILETIME=[8E635510:01C3304F]
Sender: asrg-admin@ietf.org
Errors-To: asrg-admin@ietf.org
X-BeenThere: asrg@ietf.org
X-Mailman-Version: 2.0.12
Precedence: bulk
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/asrg>, <mailto:asrg-request@ietf.org?subject=unsubscribe>
List-Id: Anti-Spam Research Group - IRTF <asrg.ietf.org>
List-Post: <mailto:asrg@ietf.org>
List-Help: <mailto:asrg-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/asrg>, <mailto:asrg-request@ietf.org?subject=subscribe>
List-Archive: <https://www1.ietf.org/pipermail/asrg/>
Date: Wed, 11 Jun 2003 17:27:54 -0200
X-MIME-Autoconverted: from 8bit to quoted-printable by www1.ietf.org id h5BJVim28314
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by www1.ietf.org id h5BJaam28495
Content-Transfer-Encoding: 8bit

Yakov Shafranovich wrote:
......
>Note, that a CRI protocol will not help here since the HotMail C/R system 
>will authenticate the sender. Only active involvement by the ISP and 
>limiting amount of outgoing email will help. RMX will not resolve this 
>issue either.
.....

Hmm, very quiet on this one. Is we then back to square one? What will this 
mean for other web mail vendors. We do have a number of Perl/php-scripts 
out, to batch connect to retreive mail from Hotmail, Lycos and others, 
modifyable to send instead. I then thought that C/R systems are a bit of a 
colossus on clay feet, but most discussion during may/june seem to have been 
about which C/R method is best, not that much about if it's practical or not 
or simpler methods to C/R and RMX, that forces more us to do more drastic 
changes of our infrastructure.

About two months ago, I presented a complementary methodology for spam 
detection, that I call the "Earnest" method, because it is based on the only 
earnest data that exists in the spams, URL's and phone numbers.

At that time Kee Hinckley had some serious views about my ideas and views, 
so I spent some time to make a more thorough survey, increasing data from my 
original 6.000 own spams with some 40.000 from the Spamhaus archives and 
another 15.000 from a guy active in european RIPE's spam discussion list, 
i.e. a total of 60.000 spams. Hopeful I have a more statistical significant 
amount of spam to revisit my idea.

For the background, from my own spams, I extracted 1.500 unique URL's and 
phone nos, from Spamhaus another 5.000 and from RIPE the last 2.500, on the 
pattern base something.com, something.biz, something.co.tw and so on.  After 
sorting my own and Spamhaus together, I had 1.200 that was common, giving a 
total of 5.300 unique entries. After adding the RIPE data, another 800 was 
common, giving a total of ca 7.000 unique URL's and numbers. When visually 
reviewing them, I also found that many where so alike in naming structure, 
that they obviously is related and that the data probably represent no more 
than 3-4.000 real web sites.

My conclusion is that even if we deal with 100:s of millions of spams, we, 
as Spamhaus estimates with the numbers of real active spammers (+200), only 
deal with a limited number of domains and phone numbers, maybe less than 
100.000, so if we concentrate the effort, as I laid out in the Earnest 
method presentation, to extract that _real_ data, it makes the filtering 
task easier and faster and require less resources.

As I said last time, the key part is that if you send a spam, you, as a 
spammer, need to make it easy for the user to go to the site, so you get 
your dues, i.e. supply a working url or readable phone number. If you code 
that url in anyway other than to code the whole page with base64 (a quit 
possible spammer solution), as "%xx" or a "&#xxx;" web char code, you forces 
your target to type in the whole URL, which leads to a drastic reduced hit 
rate for the spammer's customer. Simplicity is the main driver behind spam. 
If the process don't produce, the customers will not use spam, a simple 
economic maxim IT people often tend to forget.

And as far as I found in my tests, you can't not "char" code neither "a 
href" nor the "http://" statements of an URL, because they will then not 
work. Therefore these parts will always be in clear text and grepable. If 
those lines is decoded for any char encoding of the addresses and compared 
in an case insensitive and free text search, they can't hide, if their URL 
is there, you get a hit. And if they try to put in more URL's, they still 
don't get away. They could try to disguise the address with a line break in 
the address, but if the filter have a check for breaks before the closing 
">" that can be handled. This simpler than analyzing the whole letter.

One of Kee's objections, where that it would be a resource demanding filter. 
After studying the issue further, my only concern is a increased usage of 
the mime coding, the tests I've done with sed based char decoding and nawk 
"<!-" comments decoding filters have resulted minor resource demands, that 
most, except the large ISP sites, will have no problem with. Comment coding 
is mainly a problem for phone number matching, not for URL's, they breaks.  
Now neither sed nor nawk is the best tools for this, so result should be 
dramaticlly improved by using a purpose designed filter, than my present 
solution, that just is a shell script test filter.

But how does my list of 7.000 URL and numbers work? Well, I have today 
relegated my old mail-acount to a test account, protected by a basic shell 
script, mail filtered and matched to the list by an simple grep command. The 
list also include the names/IP-nos of some 2.000 open mail relays. I 
yesterday received the first two spams into the box for 2 months (it get ca 
10 spams a day). I also send over my present mailbox to it, to check for 
false positives and after a clean out of some 1000 old mail relays mid-May, 
I had no false positive at all. Based on that the test script yet doesn't 
have any more advanced feature as mime decoding, it greps on whole the text, 
not looking for broken lines in the URL's and so on, I regards this a fairly 
good result.

As a example, when I went from my 1.500 URL's and 3.000 relays to the 
5.300+2.000 including Spamhaus, I went from 1 spam in of 10 delivered, to 1 
spam in 300.

Another issue Kee brought up, that I had no good solution to then, where how 
to keep the target list up to date ? There is a number of actions needed to 
handle this issue:

1. There always will be an amount of spams, directly to a targeted number 
domains. If there internally exist a mailbox for forwarding of spams (larger 
corporations/sites might want a number of addresses to not make the address 
to widely known), that have a spam processor, extracting the URL's and phone 
numbers. The local users can forward their spam here, creating a locally 
adapted blocking. As my own results shows, this method gets most of 
addresses that I will encounter over a longer period, 9 of 10. (OK, this 
part might be a problem, due to Jacob Nielsens patent).

2. Due to the size of the problem, it could be regarded prudent that US FTC 
and it's EU counterpart do run honeypots, where spam is collected in the 
same way as for the local spambox, to make official lists, that can be 
merged with local lists.  With a large number of addresses over a number of 
domains (same end mailbox though), addresses distributed on "fake" web pages 
and in Usenet, these honeypots should get a fast momentum, producing lists 
that from beginning catches ca 80 % of the spams. With the local lists, I 
estimate that these two would accounts for blocking over 95 % of the 
possible spams to a individual site, within a couple of weeks.

3. Anyone interested, could set up their own honeypots, publishing their own 
lists. These will increase hits further. Oh, these could be tainted, if 
discovered, yes. That was another of Kee's original objections, but my 
answer is: what is the chance that a spammer would include a URL that my 
Aunt Agatha ever would think of sending me? Probably nil. They can't win 
this one, since it only affect mail, with just those URL's/numbers, never 
the web. If we knows that msn.com always get filtered out due to the 
spammers, well, why send that address as a link? A nuisance, yes, but we 
have adopted short codes for other events. The only thing that the spammer 
do achieve, is that some lists will get much bigger than the rest and 
thereby get suspicious.

4. With a baysian filter on the URL's, a filter could learn to catch related 
new URL's without updating the spamlist. Due to the lesser amount of data to 
analyze it is faster. With so many related sites, this would help a lot.

5. When we locally joins the existing list with an external master, it is 
simply a "sort | uniq" command. A cron job that collects the "master" file 
with some of the batch web "browsers" and does the joining. Upgrading data 
is therefor simple.

6. If you know that you absolutely have no asian users, make the spamfilter 
work with chars as "ëÃâ±ÝÀ". If it finds a number of them (look at 
combinations typical for your asian spammer) and you have a 100% success 
rate. I haven't seen a asian mail in my testbox since February, when I 
remade my test filter to dump such letters, though I have a 40% average of 
them in the incoming spam.

7. If you don't accept that someone does block a certain spammer, you simply 
remove the entry from the list (OK, only on a site basis for simplicity).

8. It is up to the taste of the MTA/UA designers if they just want to 
through away the spam or put it in a separate mailbox (IMAP/Maildir?) for 
visual inspection of both false positive and interesting spams, allow for 
spam with the "ADV:"-flag in subject or what ever. And it is up to us to 
chose MTA/UA. Keep a time limit on spam, maybe not older than, say, 2 weeks. 
I'll run my test this way today. And no-one for spammers to sue (a tell tail 
sign that they are pushed against the rail), because my company/organization 
do have an exclusive right to decide what our resources is used for, not the 
spammers. I decide how we use a retreived spamlist, not they who collected 
it.

The result should also be, that as soon the customer to a spammer moves to a 
new phone number or URL, to get away from a blocking, that one gets 
immediately tainted due to the spammer's own actions. In the end it makes 
the process not economical viable, changing URL's or numbers (i.e. call 
center operator) for every spam, particular as the process does not care 
about sub-domains, which is the most common way of changing URL today. 10-15 
subs to a main domain for some. Before I filtered those out I had some 
15.000 unique URL's.

As I said before, the most visible threat I see is that they all go over to 
base64 encoded mails, but then I tell my counterparts, "do not send me 
base64 mails, because I'll dump them. Use ftp or a web address for me to 
retrieve any attachment". So that is also a possible filter tactic. On the 
other hand, someone on the list with knowledge and time might be able to 
make a simple and fast streaming filter that works better far better than 
deview or metamail in a spam filter environment. Another threat is pure 
huliganism, there baysian filters is the only option.

Kurt Magnusson

_________________________________________________________________
Help STOP SPAM with the new MSN 8 and get 2 months FREE*  
http://join.msn.com/?page=features/junkmail

_______________________________________________
Asrg mailing list
Asrg@ietf.org
https://www1.ietf.org/mailman/listinfo/asrg