RE: Keywords, direct navigation, and search layer 2 (was: RE: "so -called" keyword and layer 3)

Yves Arrouye <> Fri, 07 December 2001 16:36 UTC

Return-Path: <>
Received: from by (PMDF V6.0-025 #44856) id <> (original mail from; Fri, 07 Dec 2001 11:36:48 -0500 (EST)
Received: from by (PMDF V6.0-025 #44856) id <> for (ORCPT; Fri, 07 Dec 2001 11:36:46 -0500 (EST)
Received: from by (PMDF V6.0-025 #44856) id <> for (ORCPT; Fri, 07 Dec 2001 11:36:46 -0500 (EST)
Received: from ( []) by (PMDF V6.0-025 #44856) with SMTP id <> for; Fri, 07 Dec 2001 11:36:46 -0500 (EST)
Received: (qmail 20536 invoked by uid 104); Fri, 07 Dec 2001 16:34:11 +0000
Received: from by with qmail-scanner-0.96 (. Clean. Processed in 0.831695 secs); Fri, 07 Dec 2001 16:34:11 +0000
Received: from ( by with SMTP; Fri, 07 Dec 2001 16:34:10 +0000
Received: From RINCON.INTERNAL.REALNAMES.COM ([ port:1895]) by Mail essentials (server 2.422) with SMTP id: <> for <>; Fri, 07 Dec 2001 08:29:50 +0000 (AM)
Received: by with Internet Mail Service (5.5.2653.19) id <XHJ5G7H2>; Fri, 07 Dec 2001 08:33:07 -0800
Date: Fri, 07 Dec 2001 08:31:46 -0800
From: Yves Arrouye <>
Subject: RE: Keywords, direct navigation, and search layer 2 (was: RE: "so -called" keyword and layer 3)
Message-id: <>
MIME-version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-type: text/plain; charset=iso-8859-1
Content-transfer-encoding: 8BIT
List-Owner: <>
List-Post: <>
List-Subscribe: <>, <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Help: <>, <>
List-Id: <>

Sorry for the over-quoting in this email but I'd rather not under-quote and
make snippets ambiguous.

> [Explanation of categories list and their use in the traditional
> word deleted.]
> Instead, someone comes to a database provider and says "I
> want to be listed this way, with a name string, a country, an
> industry code, etc".  And they can say that several times if
> they want to, varying the value for any of those facets.  If
> someone doesn't like it, they mount a challenge using
> _conventional_ mechanisms: the database providers are no more
> part of the problem that a newspaper would be if it ran an
> advertisement for a company under a name that was later
> challenged.

Well, in some way I believe that it should also be the case for today's
database providers. UDRP does not make them responsible, does it? But I can
understand that adding a category makes them look much more transparent to
trademark holders, since they provide a slot to make the category obvious,
certainly facilitating the job of trademark protection services. I just
don't believe that the category will be incorporable in a direct navigation
model for a long time, but that's okay.
> [...] To see if we finally understand each
> other, let me try to restate your comment above into the
> language of "dns search" (and the relevant information
> retrieval and classificatory systems literature as I
> understand it):
>   The database(s) for search layer 2 are going to contain a
>   full set of facets.  One can "leave one out" by asserting
>   that any searches that involve that facet should always
>   match, i.e., by giving it a "matches everything" value.  I
>   wouldn't recommend it, but that is a business decision
>   which we don't need to resolve and the marketplace will
>   figure out who is right.   And, since uniqueness is not
>   required in the database itself, I'm not going to use the
>   work "key": the database is not intrinsically relational in
>   normal form (although one might implement it that way);
>   keys are a function of search and retrieval strategies.

Yes. I have nothing against letting people give everybody category
information, as far as they understand that it is not used in say, our
service. It is good information to have when returning result sets.

>   A search in that search layer can specify values for any
>   combination of facets that the searcher, or search-vendor,
>   finds appropriate.  Leaving one out is equivalent to "match
>   anything that happens to be there".  And the question of
>   how much fuzziness to permit is also a function of the
>   search mechanism.  (I would love to see a search product as
>   general as what I'm saying here implies, but I don't expect
>   to see one out of the laboratory, nor would I expect it to
>   work at scale or to be economically viable.  But I could be
>   wrong.)
>   To invent a plausible notation for talking about this (but
>   just a notation, not a norm), we might talk about a search
>   as specifying (since we have readers of this list who may
>   not be familiar with ABNF, I'm going to use "pure" BNF for
>   the notational syntax):
>      <search> :== "{" <facet-tuple-list> <referral-range> "}"
>      <facet-tuple-list> :== "{" <facet-tuple>
>      	 [<facet-tuple-list>] "}"
>      <facet-tuple> :== <facet-name> <facet-value>
>     	 <distance-indicator>
>   The (vague) semantics for the things that may not be
>   obvious, are:
>    "distance-indicator" specifies the degree of fuzziness
>    permitted.

The notion of fuzziness is likely to generate some good debate, and will,
more importantly, be different for different service providers. For example,
today we do match without accents by default, yet have an option of obeying
the accents. I am not sure how to express that using fuzziness since "no
fuzziness" today means that "ladede" and "Ladédé" match (yes, case is
irrelevant too), yet we have an option of being even stricter than that.
Fuzziness -1? Similarly, we match the German "über" and "ueber" (but have no
option to undo that).
If there is a notion of fuzziness, but service providers implement widely
different rules (to refer to one of your favorite topics of the day, I am
sure, we do not do TC/SC mapping, yet some people may want to) as to what
"fuzziness 0" and generally "fuzziness n + 1" for any n, then how are you
going to present the options to users?

>    "referral-range" specifies how far to go down a referral
>    (or "search the next database") chain if a match is not
>    found in the initial database search.   I'd guess it would
>    be expressed in a hop-count TTL, but haven't worked all of
>    the cases through.
>    In both cases, I'm not sure that the value can really be
>    expressed as an integer or real scalar, but let's try to
>    keep at least the example simple.
> Now, in that language, your "direct navigation" keyword
> lookup process (and key), as I now understand it, might be
> expressed as:
>  {{ name-string "common name" 0 }
>   { geographic-location "country" 0 }
>   { language "language-id" 0 }
>   { industry-code "service type" 0 }
>   0 }

Well, if you want to talk about what we do and believe we'll still do, I
think it's more like (taking a specific example):

	{ { name-string "ietf" 0 }
        { geographic-location "US" 0 }
        { language "en-US" 0 }
        0 }

where the querier would leave out the industry-code facet. Now, one
interesting question is if you actually send:

	{ { name-string "ietf" 0 }
        { geographic-location "US" 0 }
        { language "en-US" 0 }
        { industry-code "NNN" 0 }
        0 }

(assuming NNN is an industry-code relevant for the IETF). Depending on the
service you are sending things to, you may get back:

1. A record that really has "NNN" as the industry-code. This is the case
when the service uses industry-code as part of its match facets.

2. A record that has "MMM" as the industry-code. That may be the case if the
service has an industry-code facet, but does not care about industry codes
at all and want to return a match on the other facets.

3. A record that has a null industry-code. Even after we agree that all
databases want to have all facets, it is not reasonable to assume that
existing system with million of names will be able to go back to every
single to fill the industry-code facet in a reasonable amount of time. So we
need to assume that these names will all have a null industry-code, and I
think it is not reasonable to also say that just like sending null in the
query means "match everything", a specific value is always matched by a null
value in the database.

The case I have philosophical problems with is case number 2, because I
consider that violating implicit assumptions made by the application (like:
a distance of 0 guarantees a match against either the same value or null),
unless there is a way for the application to know about the fact that the
service may behave this way. That is why I once offered that we should have
a way to discover the facets of a given service, so that some could be
marked as purely descriptive versus used for matching. I got back some
comments about "discovery of properties never work" but I think that a mix
of well-known properties plus some discovery is both doable and will work
well. If we do not have such an option of discovery, then the service does
not have any option but "cheat" and I believe that for something as complex
to grasp as industry categories for users, the user will be the victim.
> Where the first four null values indicate "exact match"
> (i.e., no distance permitted between the search value and the
> database value) and the last one would indicate "no
> referrals" (i.e., if it isn't found in your database, quit
> and return "not found").  The latter is, I believe, necessary
> to enforce/preserve your particular brand of keyword-based
> uniqueness.

I have a couple questions about referrals. It looks like you are assuming
that the server will follow the referrals, not the application. Is that
true? Also, do you think that the referral information will be local to the

> Aside: Where your system and this may get into trouble is
>   this model assumes that the geographic-location, language,
>   and and industry-code facets will have values based on
>   established, consensus-standardized, lists from which the
>   database entries are merely choices.

We do, yes, and I think that's okay.

>                                         Search vendors don't
>   get to make up either the facet names or the names of the
>   category values.  Without that constraint, we have a bigger
>   mess than the LDAP one, with everyone essentially selecting
>   their own schema and values.

I believe each facet can be standardized, as well as their semantics. I also
believe in freedom of adding new facets, and either discovering them
(letting different providers add more meta-content to their objects) or
publishing them. I also have the feeling that it is irrealistic to think
that each service using this new layer will have the exact same set of
facets. I imagine it could be reasonable, for example, for a "mobile web"
service to match on a "position" facet that would be filled by the mobile
provider. That could give new meaning to the Joe's Pizza problem. But I

>                                        In particular, if your
>   "service type" isn't isomorphic with the WIPO/Nice list (or
>   whatever else is chosen), you will need a mapping function,
>   s.t. that <facet-tuple> element becomes
> 		{ industry-code MapToNice ("service type") 0 }
>   I don't see a problem with doing that, and suspect your
>   going through your service-types and checking them against
>   the Nice list might be intellectually interesting (and that
>   it would ultimately provide value to your customers).

Service type is not industry code (see my previous email). We do not use
industry codes at this time.
> [...] reasons.  So permitting only an exact-match {common name,
> country, language, service type) mechanism to be exposed may
> make much more sense than doing something more general.  But
> the general _model_ is important, if only because I can prove
> it is scalable while I believe that, in the last analysis,
> your system is still subject to what we have come to call the
> "Joe's Pizza" problem -- far less quickly than "" or
> "" would be, but the potential is there.

I totally agree, and do want that outcome too. Having the general model work
is also a very good way to ensure that as users and applications become more
sophisticated, new facets that open the namespace can be added without
trouble, after having been bootstrapped through their use as descriptive