Re: ... Quipu logfiles, Garbage Collection, and Scaling Change Propagation
Ed Reed <Ed.Reed@cinops.xerox.com> Sun, 14 November 1993 20:31 UTC
Received: from ietf.nri.reston.va.us by IETF.CNRI.Reston.VA.US id aa18266;
14 Nov 93 15:31 EST
Received: from CNRI.RESTON.VA.US by IETF.CNRI.Reston.VA.US id aa18262;
14 Nov 93 15:31 EST
Received: from haig.cs.ucl.ac.uk by CNRI.Reston.VA.US id aa15444;
14 Nov 93 15:31 EST
Received: from bells.cs.ucl.ac.uk by haig.cs.ucl.ac.uk with local SMTP
id <g.03608-0@haig.cs.ucl.ac.uk>; Sun, 14 Nov 1993 19:54:52 +0000
Received: from alpha.Xerox.COM by bells.cs.ucl.ac.uk with Internet SMTP
id <g.09439-0@bells.cs.ucl.ac.uk>; Sun, 14 Nov 1993 19:54:33 +0000
Received: from slap.cinops.xerox.com ([13.180.0.107]) by alpha.xerox.com
with SMTP id <12582(1)>; Sun, 14 Nov 1993 11:54:07 PST
Received: from cinops.xerox.com by slap.cinops.xerox.com
id <29214-0@slap.cinops.xerox.com>; Sun, 14 Nov 1993 14:54:01 -0500
To: wright@lbl.gov, Roland.Hedberg@rc.tudelft.nl
Subject: Re: ... Quipu logfiles, Garbage Collection, and Scaling Change
Propagation
Cc: osi-ds@cs.ucl.ac.uk
Date: Sun, 14 Nov 1993 11:54:01 PST
Sender: ietf-archive-request@IETF.CNRI.Reston.VA.US
From: Ed Reed <Ed.Reed@cinops.xerox.com>
X-Orig-Sender: Ed.Reed@cinops.xerox.com
Message-Id: <93Nov14.115407pst.12582(1)@alpha.xerox.com>
<statistics for DSAs that are NEVER up deleted> It's gratifying to see reaffirmation that synchronizing DSA databases is only PART of the data integrity problem - synchronizing to REALITY is also important. Over a period of time registered services, services, hosts, and other network objects (people included) fall into disuse, because names change, or because they're simply decommissioned (or made redundant). How many of the 'directory synchronization' schemes fail to deal with deletions, extended periods of inactivity, etc. Do they include a flag to mark objects 'suspect' or 'known to be obsolete'. Are they ever garbage collected? Xerox' Clearinghouse (internal network) regularly has more DSAs registered than will actually respond - that's not a problem. What's important from a statistical perspective is how many countries, organizations, or org units that SHOULD be reachable ARE reachable - whether via primary or one of however many secondaries? As long as information in the DIT is available, I really don't care what the response time of a particular DSA is - shouldn't client software (including chaining DSA and LDAP services) FIND a server supporting the portion of the DIT I'm asking about? If so, then my concern about unreachable DSAs turns into a response time issue, not an availability issue. It seems to me that a well behaved directory client needs to keep track of the directory information it discovers, cacheing information about servers which are reported to support the domains of interest, and remembering which of those are 'alive' and responding to queries. When a server which had been responding stops responding, or becomes much slower (say, 3 sigma from the previous mean response time, or even 2 sigma - tunable parameter, of course) then it should begin sending queries to other servers supporting that domain instead, based on what it knows about their availability. This is basic load sharing, and is key to providing clients with reasonable levels of service through redundant (and topologically diverse) servers. Certainly, the DIT needs to be pruned of services which are no longer supported and can never be expected to reply. I'd suggest that a network management function should, as is being done now, check on the 'upness' of DSAs and what portions of the tree they support. Track over time the date they were last seen to have replied, and each tree branch owner be expected to take action to keep their part of the tree clean of dead leaves and branches. The details of how this is all done become much more important when bidirectional database replication/synchronization (what Xerox calls anti-entropy) is used instead of the Master/Slave scheme that is common, but there is another lesson we learned early on which may be appropriate as EDB sizes grow - a more network-effecient protocol is needed to propagate changes. Xerox uses two means of distributing changes - (1) store and forward messaging (ie, mail), and (2) a connection oriented RPC protocol in which a DSA contacts another DSA supporting the same EDB, and then the two DSAs list the distinguished names of objects contained in their copy of the EDB in question, along with the timestamps and checksums for the object, allowing each DSA in turn to query the other one for full object dumps of just the bits they're missing. Much more network bandwidth friendly. Of course, in a peer-to-peer environment, the process is repeated in each direction so both sides can pull information from the other side that it is missing. A final third listing of DNs and checksums and timestamps then proves the two EDB copies are in sync. But, this synchronization process is performed only nightly, so mail updates of the new bits of changed objects are mailed to each of the DSAs which are registered as holding a copy of the EDB in question. The message mailed actually contains a bodypart which is the network representation of the object containing just those attributes or group members which changed (no need to mail around the whole object - who wants to ship a g3fax picture of someone just to inform the world that their favorite drink changed?). Scaling drives you to think about reducing traffic. It happened to us, and it will happen to any sufficiently large (and thus interesting) distributed database. Ed Reed Xerox Corporation Corporate InterNet stored on them (with timestamps and check sums), and then provide