Re: root knowledge

yeongw@spartacus.psi.com Tue, 12 May 1992 16:29 UTC

Received: from nri.nri.reston.va.us by ietf.NRI.Reston.VA.US id aa24206; 12 May 92 12:29 EDT
Received: from nri.reston.va.us by NRI.Reston.VA.US id aa17772; 12 May 92 12:35 EDT
Received: from bells.cs.ucl.ac.uk by NRI.Reston.VA.US id aa17750; 12 May 92 12:35 EDT
Received: from spartacus.psi.com by bells.cs.ucl.ac.uk with Internet SMTP id <g.23173-0@bells.cs.ucl.ac.uk>; Tue, 12 May 1992 15:59:07 +0100
Received: from localhost by spartacus.psi.com (5.61/1.3-PSI/PSINet) id AA00492; Tue, 12 May 92 10:58:52 -0400
Message-Id: <9205121458.AA00492@spartacus.psi.com>
To: osi-ds@cs.ucl.ac.uk
Subject: Re: root knowledge
Cc: yeongw@psi.com
Reply-To: osi-ds@cs.ucl.ac.uk
In-Reply-To: Your message of Tue, 12 May 92 12:40:29 -0000. <199205121240.AA16195@mitsou.inria.fr>
Date: Tue, 12 May 1992 10:58:50 -0400
From: yeongw@spartacus.psi.com

This message actually has nothing to do with the management and
distribution of root knowledge. But Christian made a statement that
I absolutely feel obliged to comment on.

> The fact is that we cannot perform distributed data base access intelligently
> and resort a mixture of crude hierarchies and brute force replications.
> Hierarchies
> and caching are OK for a "read only" service, e.g. DNS. They make a stinking 
> white page service.

I couldn't agree more. In fact I would state it even more strongly by
saying that a hierarchy is not even "OK" for "read only" services.

I will now don a few flame retardant suits :-) and say

	The single biggest problem with X.500 is that
	it structures data as a tree

At the risk of splitting very fine hairs, I'll say that the
above refers not to the way information is modeled (as a tree),
but to the way information is distributed in the 'database'
(in a tree-like way).

The fact that the DIB is in fact structured as a DIT for the
purposes of determining the boundaries between the information
held by different DSAs means that it is very difficult to
represent many-many relationships in a way that makes searching
the database based on different relationships easy (or even
doable in the case of a large data set). Basically, if you embed
one relationship into the hierarchy used for distributing
information across DSAs, searching based on that one relationship
is going to yield adequate/reasonable/good performance. But
once you want to search on any other relationship besides
the one that is embedded into the DIT structure, search performance
becomes 'interesting' :-).

The workaround that we (the Internet X.500 community) have been
adopting is to store reasonably autonomous subsets of information
in single DSAs. For example, entire organizational subtrees get
stored in one DSA. Then we play indexing games to make the performance
of searches based on different criteria (the "search indexes" :-))
reasonable within that one DSA, and hope that most searches don't
extend beyond the boundaries of the information stored in that one
DSA. Carrying this to an extreme, you end up with Christian's scenario
of having to store the population of France in a single DSA and play
aforementioned indexing games to make searching on anything but
the relationship embedded into the actual DIT 'reasonable'.

Don't get me wrong: playing indexing games is not only "all right",
but should be encouraged. It solves half the problem, the half
having to do with making sure that searches that don't cross
DSA boundaries have reasonable performance. However we still
need to workaround the other half of the problem of what to do
with searches (in general, Directory operations) that have to
span multiple DSAs.

[Also, to be a little irreverent before the almighty X.500 altar :-),
 I should point out that any number of database system vendors
 would be most happy to sell us products that could run circles
 around our X.500 implementations if all we wanted was high performance
 searching on a centralized -- in a single "DSA" -- database. X.500's
 strength is in the infrastructure it provides for distributing
 information, not in its ability to provide for implementations
 that have blindingly fast operations.]

At this point, I'll get up on one of my favorite soapboxes :-) and
suggest that a real Directory needs more than just the geographical
hierarchy (the "White Pages" namespaces) in order to be useful.
In addition to the geographical hierarchy, there is a need
to represent the information from other, non-geographical
namespaces in alternate DIT hierarchies. And, of course, the
relationships between the information in the various hierarchies
should also be represented, by means of pointers.

For starters, in the Internet, I think we need to get the
domain namespace and the IP address space (which is a "namespace"
of sorts -- and yes, I do mean network addresses in general, not just
IP addresses, but the reality right now is that the pressing need
is for IP address representation) in. There are a number
of relationships, network <--> network contact, domain <--> domain
contact, domain(s) <--> network(s) to name just three pairs, which are
best (from both modeling and performance standpoints) represented
by pointers between hierarchies, and not as explicit entries shoehorned
into the existing geographical namespace.

Two things though (which I have to mention because I've been
misunderstood before :-(): 

(a) I am not advocating constructing a separate hierarchy of
    aliases/pointers/what-have-you for every possible criteria a 
    Directory user could base an operation on. Doing so is even
    less practical then trying to index every possible attribute
    in an entry within a DSA's 'database system'. Notice in the
    above that I tied the creation of a hierarchy to the existence
    of a (autonomous) namespace. Although there are certainly
    good cases for exceptions (an organizational role 'index'
    hierarchy for example), and I'll admit that I haven't thought
    this through completely (can't think this through really: need
    to deploy and play), I'll state very strongly that I think
    that alternate hierarchies should only be created to
    represent information from a separate 'namespace', and should
    actually contain useful information (ie., shouldn't just be
    a tree of pointers). Of course, I'll probably end up eating
    these words later :-) :-).

(b) specifically in reference to putting the DNS into the DIB, all
    the DNS proponents are invited to note that I did not mention
    "domain name <--> IP address" as a relationship that needs to
    be represented above. Not that it isn't a useful relationship, of
    course, just that the Internet already has a perfectly good
    way of representing this relationship in the DNS system itself.
    The point is this: I'm not interested in playing the "my
    protocol is better than your protocol" game (especially with
    you DNS people since you have a working system, us Directory folks
    don't :-) :-)). DNS information needs to go into the DIB
    so that *other* relationships can be represented. The fact that
    the domain <--> IP address relationship happens to "fall out",
    is a bonus, not the motivation for the effort [and I do have
    an answer to the "why not move the White Pages information to the
    DNS, instead of moving the DNS information into X.500" question
    too, but this message is too long already ...] from my point of view.


Wengyik