archie measurements

Mike Schwartz <schwartz@latour.cs.colorado.edu> Fri, 12 February 1993 14:20 UTC

Date: Thu, 11 Feb 1993 22:19:37 -0700
Sender: ietf-archive-request@IETF.CNRI.Reston.VA.US
From: Mike Schwartz <schwartz@latour.cs.colorado.edu>
Message-Id: <199302120519.AA01622@latour.cs.colorado.edu>
To: iafa@bunyip.com
Subject: archie measurements

As part of some research I am doing on wide area data access and
resource discovery, I and one of my students (Panos Tsirigotis) did some
analysis of about 9 months worth of archie access logs.  Alan Emtage
asked about our results today, and I sent him the following brief
summary.  I thought some others on this list might also find it
interesting, so I'm including it below
 - Mike Schwartz
   Dept. of Computer Science
   Univ. of Colorado - Boulder
   schwartz@cs.colorado.edu
-------------------------------------------------------------------------------
Alan,

Here's a quick summary:

Our initial idea was to see if there were any records in the archie
database that are particularly heavily hit (and hence good candidates
for wider replication).  We found that every record was hit by some
search in the logs, with a factor of 10:1 between the frequencies of the
most/least common hits.  That's enough to make you think about
differential replication, but not really "exciting".  We then looked
more closely, and found that many of the matches to records were caused
by bogus searches - e.g., regular expressions that hit way too many
things.  I say they are bogus because if someone specifies "*.c" (for
example), they clearly aren't using archie how it is intended (as a
white pages directory).  Archie isn't really set up for yellow pages
searches, and specifying something like this is likely to yeild tons of
poorly related files.  And, since most of the archie searches leave
the default result count limit, they'll only get the 85 or so first
matches - almost guaranteed to be useless.

I don't recall the exact numbers, but most REs seemed pretty useless.  I
think either people don't know how to write good REs, or that REs are
inherantly fairly useless for this application.  (I personally always do
case insensitive substring searches.)

To distill the results some more, we set a "useless" threshold -
searches that match too many records are considered bad searches.  We
factored out replicas of files across sites (e.g., so that
"traceroute.c" occuring in 80 sites only counts as one match).  Setting
this threshold at 50 string-matches/search, we found that out of 1134673
strings, 947487 were not hit by any search.  This says that 84% of the
database is essentially useless.  The other 16% could quite fruitfully
be heavily replicated.

By the way, in doing this analysis we found that we had to tune our code
quite a bit.  Our initial attempt showed that processing the data using
algorithms like what is found in the archie 2.X server code would have
taken about 9 months of elapsed time to process the entire set of logs
we had (I think it was about 9 months worth of logs - which says you
probably keep your servers running full out processing searches).  After
tuning our code and switching to a faster machine (old machine = Sun
4/280 = 8 MIPS; new machine = HP snake = 80 MIPS), we got it down to a
few days.  Archie avoids most of this work with its max hit count; once
you hit so many matches, you stop looking through the database.
However, if someone specifies a string that doesn't match anything, it
goes through the whole database, at great expense (e.g., many minutes of
CPU time on an HP snake).  Perhaps you can find a way to detect that
this is happening, and save your servers some cycles.

Here are a few other interesting statistics:
        - gifs are only 1.2% of all searches (although I did simplistic
          check; didn't look for sneaky REs).  I guess people know they
          can't find them in archie.  In contrast, my FTP measurement study
          (a trace-driven distributed caching simulation paper about which
          will be available soon) showed that 22% of all transfers (29% of
          duplicate transmissions) = gifs.  So, people can't find them with
          archie, but when they do find them they consume lots of Internet
          bandwidth with them.  Intuitively we knew that; this just
          quantifies it
        - we didn't do careful measurements here, but it's pretty clear that
          FTP space is over replicated.  For example, I just searched
          archie for traceroute, and got this:
                - found on 88 hosts in 82 domains in 17 countries
                - what was found was: executable (5 versions), manpage
                  (7 versions), standalone source file (6 versions),
                  compressed tar of source (20 versions).

          I find it interesting that the data space (FTP) is over
          replicated, while the directory (archie) is under replicated
          (as evidenced by the high variance on search response times).
          I think the future will depend increasingly on directory and
          decreasingly on end data.  So, we as a community need to get
          our wide area information replication act together.
 - Mike

archie measurements Mike Schwartz