LDAP shall speak unto LDAP + first cut at best-first searching

Simon E Spero <ses@tipper.oit.unc.edu> Sat, 27 November 1993 23:24 UTC

Received: from ietf.nri.reston.va.us by IETF.CNRI.Reston.VA.US id aa03270; 27 Nov 93 18:24 EST
Received: from CNRI.RESTON.VA.US by IETF.CNRI.Reston.VA.US id aa03266; 27 Nov 93 18:24 EST
Received: from haig.cs.ucl.ac.uk by CNRI.Reston.VA.US id aa11799; 27 Nov 93 18:24 EST
Received: from bells.cs.ucl.ac.uk by haig.cs.ucl.ac.uk with local SMTP id <g.04372-0@haig.cs.ucl.ac.uk>; Sat, 27 Nov 1993 22:59:10 +0000
Received: from tipper.oit.unc.edu by bells.cs.ucl.ac.uk with Internet SMTP id <g.14327-0@bells.cs.ucl.ac.uk>; Sat, 27 Nov 1993 22:59:03 +0000
Received: from localhost.oit.unc.edu by tipper.oit.unc.edu (SMI4.1/FvK 1.02) id AA01189; Sat, 27 Nov 93 17:58:59 EST
Message-Id: <9311272258.AA01189@tipper.oit.unc.edu>
To: ldap@umich.edu, osi-ds@cs.ucl.ac.uk
Subject: LDAP shall speak unto LDAP + first cut at best-first searching
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Sat, 27 Nov 93 17:58:59 -0500
Sender: ietf-archive-request@IETF.CNRI.Reston.VA.US
From: Simon E Spero <ses@tipper.oit.unc.edu>

After some rather severe patching and tweaking on both sides, the umich 
server and my client are now speak yer actual ldap to each other. Thanks to
Tim for fixing things so quickly. 

Now that I have the protocol layer I've been working on a general
search harness I can plug various strategies into. As I've been
filling in the blanks, I've come upon what seems to be a nice search
strategy based on best first search for solving the soft pages problem
(name->url mapping). Given a few assumptions it should return
near-optimal results. I'm still a bit fuzzy as to how long the search
will take, especially when the object being searched for does not exist.
There may a slight problem in the handling of servers which poll themselves
which I can't quite spot at the moment. I need more sleep.

Simon

First cut at best-first searching.


Definitions:

Softpages problem 	The task of locating the most appropriate copy
			of an object (images, software, etc).

Centroid		A centroid is an object that represents a collection of
			other objects. In this case we shall use the term to
			describe a record containing a list of the 
			attribute-value pairs present on a directory server.


Topological hierachy	In the modified [C]LDAP a server can be a member of 
			serveral different hiearchies. The topological 
			hierachy is contructed by placing links between 
			servers which are at opposite ends of networked
			connections. This may or may not be precisely the
			same as the underlying physical network, but is 
			intenged to be a fairly close approximation. 

  			 Sites with multiple connections may have multiple 
			parents.

			 In the topological hiearchy, parent-child relation-
			ships are determined by which server polls the other.
			Since connectivity is symetric, two servers may poll
			each other without Oedipal complexities.

Cost(h1,h2,k)		The time taken to initiate a transfer and recieve 'k'
			bytes from 'h1' to 'h2'

---------------

Alogrithm:

part 1 - centroid weightings.

Each term in a centroid is given a weight. For the topological hiearchy, we
shall define this weight to be a measure of the Cost of accessing items 
referenced by that term from this server. If there are several different
costs associated with a term, the weighting used should be the lowest
value. The pollee should also identify the topological parents which it
does not poll [This isn't quite right - ses]

When a poller retrieves a topological centroid and adds it to its database
of forwarding information it should increment the weights by the Cost of
accessing the polled server. Intuitively this value is valid because we are
measuring the cost of accessing the document via the network link 
corresponding to this topological link.

The polled server should maintain an upwards link weighted by the cost of 
accessing the poller.

part 2 - search strategy

We shall assume that the server returns a list of either records or 
referrals, each labelled with its cost. The client should increment the
costs returned by the cost of accessing the server.

Client-Side
-----------
let Q be a queue of records and servers, ordered in ascending order of cost.

Initialise R, the set of already queried servers, to NIL
Initialise Q to the default server

While Q is not empty
	let item = head(Q)
	
	if item is a record, 
		return the record
	else
		remove item from Q 
		add item to R
		add the results of querying 'item' to Q, removing
			all items already in R.
	fi
endwhile
Return not-found
--------
server side.
Return all matches to the query, together with the server's topological 
parents.

-----------

Although the algorithm  is fairly simple to understand when a match exists,
the case where the resource does not exist is slightly more complicated.
In this case, the queue will only contain servers topological parents. 
It might seem that this would lead to an explosive fan-out; however,
centroids come to the rescue here, restricting the search to only 
upwards links. These will eventually converge upon a backbone, and die off
pretty rapidly (we can consider a backbone to  poll all its links - much
of the internal topology could be collapsed into a large cloud).

At least, that's what I think will happen - I'll try and borrow a graph
theorist on Monday and use a real brain. If my analysis is wrong (and 
I think I might be handwaving over something important), the algorithm can
be saved by adding a time/number of queries limit, and then failing over
to a simple direct lookup, which may not yield the best result, but 
will at least give a result.