Research into distribution of names

Paul Barker <P.Barker@cs.ucl.ac.uk> Tue, 29 November 1994 20:22 UTC

To: quipu@cs.ucl.ac.uk, osi-ds@cs.ucl.ac.uk
Subject: Research into distribution of names
Date: Tue, 29 Nov 1994 17:48:44 +0000
Sender: ietf-archive-request@IETF.CNRI.Reston.VA.US
From: Paul Barker <P.Barker@cs.ucl.ac.uk>
Message-ID: <9411291522.aa14201@CNRI.Reston.VA.US>

Greetings,

  If you are actively using a Quipu X.500 system, I wonder whether you can 
do me a favour.  I am doing some research into querying algorithms for 
white pages directory services.  I want to learn something about 
the distribution of names in a white pages directory.  I have written 
a simple script which produces output something like that appended.  
It tells me how many surnames occur once, twice, etc in the 
database.  It also tells me the proportion of names which occur no more than 
23 times - i.e. they would fit on a typical screen.

To be able to help you need to be using Quipu without the TURBO database -
i.e. you must be using EDBs.  Of course it is possible to convert between
formats but I don't want to ask people to go to too much trouble ...

To use the script you must cd to the root of your organisation database.
Then run the script and send me the output.

Script and sample output follow.

Thanks for any help you can offer.  I'll summarise findings to the lists as
soon as possible if I get enough data.

Paul
===============script=======================
#!/bin/sh

find . -name EDB -exec egrep -i "^surname=|^sn=" {} \; |
sed ' s/^sn=//
      s/^surname=//
      s/\&\\//
      s/^[ ]*//
      s/[ ]*$//' |
tr 'A-Z' 'a-z' |
sort | uniq -c |sort -n > /tmp/surnames
awk ' BEGIN {
	cnt = 0
	lastno = 0
	totpeople = 0
}

{
	if ($1 == lastno)
		cnt++
	else
	{
		if (lastno != 0)
			tots[lastno] = cnt
		cnt = 1
	}
	totpeople += $1
	lastno = $1
}
END {
	tots[lastno] = cnt
	for (i = 1; i <= lastno; i++)
		if (tots[i] != 0)
			printf("%4d %5d %6.2f\n", i, tots[i], tots[i] * i * 100 / totpeople)
	tot23 = 0
	for (i = 1; i <= 23; i++)
		tot23 += tots[i] * i
	printf("< 23 names   %6.2f\n", tot23 * 100 / totpeople)
	printf("org size     %6d\n", totpeople)
} ' /tmp/surnames
rm /tmp/surnames

- -----------------sample output----------------------
   1  6093  45.12
   2  1001  14.83
   3   313   6.95
   4   152   4.50
   5   123   4.55
   6    52   2.31
   7    45   2.33
   8    23   1.36
   9    17   1.13
  10    28   2.07
  11    15   1.22
  12    12   1.07
  13    10   0.96
  14     8   0.83
  15     5   0.56
  16     2   0.24
  17     5   0.63
  18     4   0.53
  19     3   0.42
  20     2   0.30
  21     4   0.62
  23     5   0.85
  24     2   0.36
  26     4   0.77
  28     1   0.21
  29     2   0.43
  30     3   0.67
  31     2   0.46
  32     2   0.47
  36     1   0.27
  38     1   0.28
  41     1   0.30
  47     1   0.35
  51     1   0.38
  53     1   0.39
  82     1   0.61
  90     1   0.67
< 23 names    93.39
org size      13504

------- End of Forwarded Message

Research into distribution of names Paul Barker