[Asrg] 2.a.1 Analysis of Actual Spam Data

OK, gang, another summary here.

So far, one person has volunteered to assist w/ the project in terms of
"owning" an email address. Mahalo Nui to Selby Hatch for standing up.

Terry and gang have done an admirable job of setting up the outline of a
plan which I've edited/summarized below. Very much appreciated.

So what we have now is 1 email owner volunteer and a plan that calls for
60 email addresses, possibly spread out across N domains.  And one call
to use 60 address per domain potentially gives us hundreds of addresses
to deal with. And if we're going to do a Terry-resilient test, we will
need to treat these email addresses as normally as possible, normal
meaning that we open them, click on them, etc.

Unless we get more people to agree to assist, we can't do this
experiment.

Last chance to stand up and help. If you're going to do this, do it now.
The 8/21 deadline (for this project to gather up a sufficiently-sized
group) is fast approaching and 1 volunteer ain't gonna cut it.

Peter

============== Experimental Design ================

550 Response Experiment

What we're trying to determine is: will "hard bounce" handling of spam
(such as 550 no such user) reduce the amount of "spam attacks" to a
given email address over time versus an email address that does not
employ such tactics? 

If you're confident of your stat background, skip this first paragraph
and go straight to the bullet points. If not, then the background
material in this paragraph provides context for the bullet points.
Viewed thru the lens of inferential statistics, there's only two kinds
of variance in the world: explained/unexplained, aka
between-group/within-group, or systematic/random. Robust experimental
design exerts as much control as possible over everything BUT the
independent variable, which is the only thing that's allowed to vary
systematically between the groups. Everything else that varies, however
slightly, between the two conditions hurts chances of meangingful
results. (If the extraneous variance is systematic, then the design is
confounded, and the results--whatever they are--meaningless; if the
extraneous variance is random, then statistical power is compromised.) 

-  Ensure *crisp* separation of the independent variables. If the
analytical goal is to study the effects of 550s, then have that be the
*only* source of systematic variance. DO NOT "dilute" your systematic
variance by confounding it with other variables (visibility, phase of
the moon, eye color, etc.) 

-  (Under the heading of "Hey doc, it hurts when I do this...") If daily
spam volume is too noisy (and the DATA, not the statistician, "say" that
it is), then pick a dependent measure that's more naturally
noise-resistant (say, monthly spam volume, or even quarterly, if need
be). Reliablility of initial measurement is always preferable to _post
hoc_ "noise reduction." 

-  Studiously ensure and maintain homogeneity of the experimental
conditions throughout the course of the experiment. 
Mechanics:

-  Create an absolute minimum of 60 *pairs* of email addresses. (The
"magic number 30" assumes data to be noise-free. Statistical power is a
function of the number of "subjects," not the number of measurements.)
The use of "otherwise identical" *pairs* of addresses allows a little
more statistical power to be squeezed from the data at analysis time. 

If the one-TLD experiment uses 60 pairs of adresses, then a multi-TLD
experiment must use 60 pairs for each TLD.  Simple as that. Going to
even a small number of TLDs (eg 3 TLDs) while keeping the original
number of addreseses as you suggest is going to be a disaster if the TLD
does have some effect, as it reduces the amount of data which can tell
you about the effects of the 550 responses where they are the only
independent variable by a factor of three. It would be helpful not to
restrict the TLDs to those where English is thh prime language, as in
the three you list.  Maybe use .com, .uk, .fr, .de (plus .org and .net
maybe).

There are four potential gains to using several TLDs, provided that
enough data is collected to make a valid experiment within each
individual TLD. First, we can see whether the 550 method has different
effects in different domains; second, we can get some idea of the effect
of tld on spam volume (anecdotal evidence conflicts here, and I've seen
no solid numbers);  third, if the tld does in fact make no difference we
have several times as much data to work with; fourth, if the 550
response does indeed have an effect we will be able to see if part of
that effect is a reduction or increase in the unexplained variance.

-  Randomly assign each address in a pair to an experimental condition;
the addresses in the experimental group never (repeat: never) do
anything but throw 550s; addresses in the control group "take anything."

 * Cautionary aside: If it were me, I would zealously protect from the
general public any knowledge of which addresses were in which
experimental group. As proof against experimenter mortality, I'd ensure
that 3 different people knew which-was-which, so that the study could
continue if I got hit by a bus. But I'd also limit that knowledge to
*just* those 3 folks. (In experimental-design jargon, the study is
referred to as "blind.") 

-  In a perfect universe, all addresses in both groups are served from
one and only one mail server. That way, "server status" affects all
address pairs in both groups identically. 

-  Insofar as possible, ensure that each address within a pair
achieves/receives "identical" visibility. Each control address within a
pair should "shadow" its experimental counterpart as precisely as
possible. 
 * if one address signs up for a list, posts to a newsgroup, appears on
a Web page, or whatever, the other one should do it too, on the very
same day .

Perhaps and acceptable automated approach would be to publish the
addresses on the same newsgroups on the same day, and also publish the
email addresses on the same Web site.

-  While waiting for time to pass, order a copy of Kanji, G. (1999). One
hundred statistical tests. ISBN: 0-7619-6151-8 (This is a very handy
"cookbook" that contains raw-score formulae for just about every
inferential statistical test there is.) 

-  At the end of the experiment, pull the pairs apart and compute a
regression equation for each experimental group. 

-  Some folks may recall me saying that the slope of the regression line
is not intrinsically informative. (And it isn't; dispersion, not slope,
expresses the degree of relatedness between regression variables.)
However, the *difference between two slopes* of otherwise "identical"
conditions can be informative. 
 * if the beta weight of the regression line for the control group is
smaller (even slightly) than that of the experimental group, stop. Fail
to reject the null hypothesis and move on to something else. 
 * Differences between the slopes can compared via t-test. If that
difference-in-slopes doesn't make at least 0.01 (TWO-tailed), stop. Fail
to reject the null hypothesis and move on to something else. 
 (Remember, getting "doubles" when throwing a pair of dice is
"statistically significant" at p=0.05.) 

-  Having determined the "direction" of the effect, the magnitude of the
effect can be estimated via paired t-test. Again, the goal is 0.01 or
bust (though 1-tailed 0.01 is now "within reach"). 

_______________________________________________
Asrg mailing list
Asrg@ietf.org
https://www1.ietf.org/mailman/listinfo/asrg

[Asrg] 2.a.1 Analysis of Actual Spam Data - next steps