archie measurements
Mike Schwartz <schwartz@latour.cs.colorado.edu> Fri, 12 February 1993 14:20 UTC
Received: from ietf.nri.reston.va.us by IETF.CNRI.Reston.VA.US id aa04075; 12 Feb 93 9:20 EST
Received: from CNRI.RESTON.VA.US by IETF.CNRI.Reston.VA.US id aa04061; 12 Feb 93 9:20 EST
Received: from [192.77.49.150] by CNRI.Reston.VA.US id aa11846; 12 Feb 93 2:14 EST
Received: by mocha.bunyip.com (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA08569 on Fri, 12 Feb 93 00:19:56 -0500
Received: from latour.cs.colorado.edu by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA08565 (mail destined for /usr/lib/sendmail -odq -oi -fiafa-request iafa-out) on Fri, 12 Feb 93 00:19:44 -0500
Received: by latour.cs.colorado.edu id AA01622 (5.65c/IDA-1.4.4 for iafa@bunyip.com); Thu, 11 Feb 1993 22:19:37 -0700
Date: Thu, 11 Feb 1993 22:19:37 -0700
Sender: ietf-archive-request@IETF.CNRI.Reston.VA.US
From: Mike Schwartz <schwartz@latour.cs.colorado.edu>
Message-Id: <199302120519.AA01622@latour.cs.colorado.edu>
To: iafa@bunyip.com
Subject: archie measurements
As part of some research I am doing on wide area data access and resource discovery, I and one of my students (Panos Tsirigotis) did some analysis of about 9 months worth of archie access logs. Alan Emtage asked about our results today, and I sent him the following brief summary. I thought some others on this list might also find it interesting, so I'm including it below - Mike Schwartz Dept. of Computer Science Univ. of Colorado - Boulder schwartz@cs.colorado.edu ------------------------------------------------------------------------------- Alan, Here's a quick summary: Our initial idea was to see if there were any records in the archie database that are particularly heavily hit (and hence good candidates for wider replication). We found that every record was hit by some search in the logs, with a factor of 10:1 between the frequencies of the most/least common hits. That's enough to make you think about differential replication, but not really "exciting". We then looked more closely, and found that many of the matches to records were caused by bogus searches - e.g., regular expressions that hit way too many things. I say they are bogus because if someone specifies "*.c" (for example), they clearly aren't using archie how it is intended (as a white pages directory). Archie isn't really set up for yellow pages searches, and specifying something like this is likely to yeild tons of poorly related files. And, since most of the archie searches leave the default result count limit, they'll only get the 85 or so first matches - almost guaranteed to be useless. I don't recall the exact numbers, but most REs seemed pretty useless. I think either people don't know how to write good REs, or that REs are inherantly fairly useless for this application. (I personally always do case insensitive substring searches.) To distill the results some more, we set a "useless" threshold - searches that match too many records are considered bad searches. We factored out replicas of files across sites (e.g., so that "traceroute.c" occuring in 80 sites only counts as one match). Setting this threshold at 50 string-matches/search, we found that out of 1134673 strings, 947487 were not hit by any search. This says that 84% of the database is essentially useless. The other 16% could quite fruitfully be heavily replicated. By the way, in doing this analysis we found that we had to tune our code quite a bit. Our initial attempt showed that processing the data using algorithms like what is found in the archie 2.X server code would have taken about 9 months of elapsed time to process the entire set of logs we had (I think it was about 9 months worth of logs - which says you probably keep your servers running full out processing searches). After tuning our code and switching to a faster machine (old machine = Sun 4/280 = 8 MIPS; new machine = HP snake = 80 MIPS), we got it down to a few days. Archie avoids most of this work with its max hit count; once you hit so many matches, you stop looking through the database. However, if someone specifies a string that doesn't match anything, it goes through the whole database, at great expense (e.g., many minutes of CPU time on an HP snake). Perhaps you can find a way to detect that this is happening, and save your servers some cycles. Here are a few other interesting statistics: - gifs are only 1.2% of all searches (although I did simplistic check; didn't look for sneaky REs). I guess people know they can't find them in archie. In contrast, my FTP measurement study (a trace-driven distributed caching simulation paper about which will be available soon) showed that 22% of all transfers (29% of duplicate transmissions) = gifs. So, people can't find them with archie, but when they do find them they consume lots of Internet bandwidth with them. Intuitively we knew that; this just quantifies it - we didn't do careful measurements here, but it's pretty clear that FTP space is over replicated. For example, I just searched archie for traceroute, and got this: - found on 88 hosts in 82 domains in 17 countries - what was found was: executable (5 versions), manpage (7 versions), standalone source file (6 versions), compressed tar of source (20 versions). I find it interesting that the data space (FTP) is over replicated, while the directory (archie) is under replicated (as evidenced by the high variance on search response times). I think the future will depend increasingly on directory and decreasingly on end data. So, we as a community need to get our wide area information replication act together. - Mike
- archie measurements Mike Schwartz