Re: [Asrg] Summary of junk button discussion

Alessandro Vesely <vesely@tana.it> Sat, 27 February 2010 11:47 UTC

Return-Path: <vesely@tana.it>
X-Original-To: asrg@core3.amsl.com
Delivered-To: asrg@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 02EC728C13C for <asrg@core3.amsl.com>; Sat, 27 Feb 2010 03:47:13 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.451
X-Spam-Level:
X-Spam-Status: No, score=-3.451 tagged_above=-999 required=5 tests=[AWL=-1.188, BAYES_00=-2.599, HELO_EQ_IT=0.635, HOST_EQ_IT=1.245, MANGLED_SPAM=2.3, RCVD_IN_DNSWL_MED=-4, SUBJECT_FUZZY_TION=0.156]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 8DvRfHuXby6g for <asrg@core3.amsl.com>; Sat, 27 Feb 2010 03:47:11 -0800 (PST)
Received: from wmail.tana.it (www.tana.it [62.94.243.226]) by core3.amsl.com (Postfix) with ESMTP id 9DA6228C117 for <asrg@irtf.org>; Sat, 27 Feb 2010 03:47:11 -0800 (PST)
Received: from [172.25.197.158] (pcale.tana [172.25.197.158]) (AUTH: CRAM-MD5 515, TLS: TLS1.0,256bits,RSA_AES_256_CBC_SHA1) by wmail.tana.it with ESMTPSA; Sat, 27 Feb 2010 12:49:28 +0100 id 00000000005DC046.000000004B8906C7.00007E69
Message-ID: <4B8906C7.9010709@tana.it>
Date: Sat, 27 Feb 2010 12:49:27 +0100
From: Alessandro Vesely <vesely@tana.it>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.7) Gecko/20100111 Thunderbird/3.0.1
MIME-Version: 1.0
To: asrg@irtf.org
References: <20100225054546.16850.qmail@simone.iecc.com> <4B86172D.2080702@nortel.com> <4B86AD93.1050800@tana.it> <4B86DD80.8060508@nortel.com> <4B87BD07.9000502@tana.it> <4B88BA09.7050700@nortel.com>
In-Reply-To: <4B88BA09.7050700@nortel.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Asrg] Summary of junk button discussion
X-BeenThere: asrg@irtf.org
X-Mailman-Version: 2.1.9
Precedence: list
Reply-To: Anti-Spam Research Group - IRTF <asrg@irtf.org>
List-Id: Anti-Spam Research Group - IRTF <asrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/listinfo/asrg>, <mailto:asrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/asrg>
List-Post: <mailto:asrg@irtf.org>
List-Help: <mailto:asrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/asrg>, <mailto:asrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Sat, 27 Feb 2010 11:47:13 -0000

On 27/Feb/10 07:22, Chris Lewis wrote:
> On 2/26/2010 7:22 AM, Alessandro Vesely wrote:
>> [...]
>>> Heck, SpamAssassin even manages to tune Bayesian without having any end-user feedback at all.
>>
>> I never adventured into such esoteric settings. Are there howtos or any docs about it?
>
> I think it's called "Autolearn". I think it works by treating SA scores > <threshold> as "spam", and scores < <possibly a different threshold> as "ham", and tunes Bayesian from that. IOW: the existing SA rules refine Bayesian, and in the long term this allows Bayesian to cross-correlate across individual emails, and Bayesian score stuff that the SA rules don't necessarily even see.

I've found some explanation in 
http://spamassassinbook.packtpub.com/chapter9_preview.htm . My 
understanding is that auto-learn works against that gray area of 
uncertain cases. The book's author recommends not to rely on that 
feature alone. He notes that "Once a false positive occurs the 
Bayesian database will begin to lose effectiveness, and future 
Bayesian results will be compromised."

If we consider machine-learning for what it is, we must agree that 
current technology does not allow 'puters to understand human speech 
better that we do. Even though SA is able to classify a bunch of 
messages much better that an unmotivated human, it does not really 
/understand/ their contents. Therefore, it has to be trained by a 
(motivated) human, which implies interaction with users.

>> [...] in order to attach to junk buttons a meaning of "filter messages /like/ this" we would need to define what that means in rather unambiguous terms.
>
> No, you don't. That's up to the implementer of the report handler what it does.

I agree that it's up to abuse report consumers to state what they do.

A discussion about "generalized FBLs" will probably involve concerns 
about who is entitled to consume ARs, and may also consider whether 
consuming ARs implies any duty. Letting users know about any outcome, 
or letting them further modify any such outcome, are examples of 
possible duties. The less of them, the better.

For example, consider the last MX sending an AR to the 1st MX, as in 
the picture in http://wiki.asrg.sp.am/wiki/Abuse_Reporting . If the 
1st MX feeds the reported message to its Bayesian engine, then it 
should also allow forwarded users to check uncertain messages and 
correct any false positives. If it does not grant such interactivity, 
users may want to omit forwarding to it, in order to avoid losing mail.

Would it make sense to require that interactive filtering activity is 
limited to the last MX? Bandwidth-wise, it is counter-intuitive. It 
may be better to avoid this argument entirely.

> Just as it is with Bayes.
>
> Why are you treating this any different than spam/ham training in Bayes? It's no different.

Potentially, it is orders of magnitude better than Bayes.

>> I'd lean toward specifying just how to deliver abuse reports. Neither junk buttons nor their color should be mandated.
>
> Who is trying to specify buttons or their color?

I hope we won't. But it seems difficult to specify MUA's AR addresses 
without creating false expectations.

> I'm aiming for a specification that permits a single <user action> to communicate upstream for _both_ filtering and reporting purposes, where whether it's used for filtering or reporting or both in any given instance is up to the site admin and/or end-user.

+1, and I would welcome an efficient IMAP implementation in that 
sense. However, the spec should also allow to just send complaints.