Essence prototype announcement

Mike Schwartz <schwartz@latour.cs.colorado.edu> Wed, 13 January 1993 21:35 UTC

Received: from ietf.nri.reston.va.us by IETF.CNRI.Reston.VA.US id aa15048; 13 Jan 93 16:35 EST
Received: from CNRI.RESTON.VA.US by IETF.CNRI.Reston.VA.US id aa15044; 13 Jan 93 16:35 EST
Received: from kona.CC.McGill.CA by CNRI.Reston.VA.US id aa12927; 13 Jan 93 16:36 EST
Received: by kona.cc.mcgill.ca (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA04437 on Wed, 13 Jan 93 12:49:59 -0500
Received: from latour.cs.colorado.edu by kona.cc.mcgill.ca with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA04427 (mail destined for /usr/lib/sendmail -odq -oi -fiafa-request iafa-out) on Wed, 13 Jan 93 12:49:27 -0500
Received: by latour.cs.colorado.edu id AA25471 (5.65c/IDA-1.4.4 for iafa@cc.mcgill.ca); Wed, 13 Jan 1993 10:46:56 -0700
Date: Wed, 13 Jan 1993 10:46:56 -0700
Sender: ietf-archive-request@IETF.CNRI.Reston.VA.US
From: Mike Schwartz <schwartz@latour.cs.colorado.edu>
Message-Id: <199301131746.AA25471@latour.cs.colorado.edu>
To: Essence-announcement-list@latour.cs.colorado.edu
Subject: Essence prototype announcement

Essence is a resource discovery system that exploits file semantics to
index both textual and binary files.  Essence generates summaries that
can be used to browse files before retrieving them across slow network
links, as well as space efficient indexes.  Essence understands nested
file structures (such as uuencoded, compressed, "tar" files), and
recursively unravels such files to generate summaries for them.  These
features allow Essence to be used in a number of useful settings, such
as anonymous FTP archives.  The prototype generates WAIS-compatible
indexes, allowing WAIS users to take advantage of the Essence indexing
methods.

WAIS users can try Essence using the ".src" file enclosed below.  This file
also describes where to get the prototype source code and a paper about this
system.

	Darren Hardy and
	Michael Schwartz
	Dept. of Computer Science
	Univ. of Colorado - Boulder

-------------------------------------------------------------------------------

(:source
   :version		3
   :ip-address		"128.138.243.151"
   :ip-name		"ftp.cs.colorado.edu"
   :tcp-port		8000
   :database-name 	"aftp-cs-colorado-edu"
   :cost 		0.00
   :cost-unit 		:free
   :maintainer 		"hardy@cs.colorado.edu"
   :description 	
"You can use this WAIS server to search and retrieve files from the
anonymous ftp archive on ftp.cs.colorado.edu [128.138.243.151].  We
used Essence, a resource discovery system based on semantic file
indexing, to build the WAIS index for this server.  As explained below,
Essence currently only allows the retrieval of file summaries through
WAIS.  To retrieve entire files, use anonymous ftp on ftp.cs.colorado.edu.

Essence exploits file semantics to index both textual and binary
files.  By exploiting semantics, Essence extracts keywords that
summarize a file, and generates a compact yet representative index.
Essence understands nested file structures (such as uuencoded,
compressed, ``tar'' files), and recursively unravels such files to
generate summaries for them.  Essence generates indexes that are ten
times smaller than WAIS indexes, but retain the fine-grained
information access that WAIS's full-text indexes provide.

Furthermore, Essence generates WAIS-compatible indexes allowing WAIS
users to make use of Essence's indexing capabilities.  This is one of
the ways that the Networked Resource Discovery Project at the
University of Colorado has extended the conceptual paradigm of the type
of information that WAIS handles.

If you would like to learn more about Essence, you can obtain the
source to the Essence prototype and a paper which appears in the 1993
Winter USENIX Technical Conference, San Diego, CA, January 1993, 
pp. 361-374.  Both the paper and the prototype are available via 
anonymous ftp from ftp.cs.colorado.edu in /pub/cs/distribs/essence.  
Or search for the keyword 'Essence' using this WAIS server to find all 
of the files on ftp.cs.colorado.edu that are related to Essence; you 
will find the files for both the paper and the prototype.

This WAIS server was created in December 1992 by Darren R. Hardy and
Michael F. Schwartz as part of the Networked Resource Discovery
Project.  You may reach them at the Department of Computer Science,
University of Colorado, Boulder, CO  80309-0430, or via email at
hardy@cs.colorado.edu and schwartz@cs.colorado.edu.

Below is some more information about the WAIS interface to Essence.

	Essence exports its indexes through WAIS's search and
	retrieval interface, allowing users to use tools such as
	waissearch and the X Windows-based graphical user interface
	xwais.  In order to generate WAIS-compatible indexes,
	Essence uses WAIS's indexing software to index the Essence
	summary files.  This mechanism generates full-text WAIS
	indexes from the Essence summary files.

	We modified the WAIS indexing mechanism to understand the
	format of the Essence summary files, so that it generates
	meaningful WAIS headlines.  These headlines provide users
	with a short description of a single file, usually a
	filename.  With Essence, headlines represent a file's core
	filename, its actual filename, and its file type.

	To support additional file types, WAIS must be recompiled
	with new procedures that understand these file types.  With
	Essence, one need only write a new summarizer, add its name
	to a configuration file, and add new heuristics for
	identifying the file type; no recompilation is necessary.
	In this sense, Essence modularizes the typed-file indexing
	extensions that WAIS can use, because it removes the
	keyword extraction process from WAIS and places it instead
	in Essence.  Essence is better suited to incorporating new
	file types, and can be quickly adapted to become a
	comprehensive indexing system.

	The following waissearch output shows an example search of
	an index generated by Essence of the ftp.cs.colorado.edu
	anonymous FTP file system.  It shows an ordered list of the
	ten files that best match the keyword netfind.  Netfind is
	an Internet user directory service.  The headlines have up
	to three fields representing the matching file: the core
	filename, the filename (if different from the core
	filename), and the file type.
 
------------------------------------------------------------

csh% waissearch netfind
   1:  /cs/ftp/techreports/schwartz/PostScript/Techniques.Wide.Area.ps.Z 
       Techniques.Wide.Area.ps PostScript

   2:  /cs/ftp/techreports/schwartz/PostScript/ALL.PS.tar.Z 
       PostScript/Techniques.Wide.Area.ps PostScript

   3:  /cs/ftp/distribs/netfind/netfind3.10.tar.Z ServerShell/nsh.c C

   4:  /cs/ftp/distribs/netfind/README  README

   5:  /cs/ftp/distribs/netfind/netfind3.10.tar.Z README README

   6:  /cs/ftp/distribs/netfind/netfind3.10.tar.Z Doc/netfind.1 ManPage

   7:  /cs/ftp/techreports/schwartz/PostScript/Proj.Overview.ps.Z 
       Proj.Overview.ps PostScript

   8:  /cs/ftp/techreports/schwartz/PostScript/RD.Comparison.ps.Z 
       RD.Comparison.ps PostScript

   9:  /cs/ftp/techreports/schwartz/PostScript/ALL.PS.tar.Z 
       PostScript/Proj.Overview.ps PostScript

   10: /cs/ftp/techreports/schwartz/PostScript/ALL.PS.tar.Z 
       PostScript/RD.Comparison.ps PostScript
csh%

------------------------------------------------------------

	Consider the effectiveness of the example search shown
	above.  The best match is a PostScript paper that discusses
	a number of techniques for distributed information systems,
	with particular emphasis on techniques demonstrated by
	Netfind; the second match is the same file, but found in
	the compressed tar distribution ALL.PS.tar.Z.  The third
	match is the C source code for the interactive user
	interface to Netfind.  The fourth match is the README file
	found in the Netfind distribution directory; the fifth
	match is the same file, but found in the compressed tar
	distribution netfind.3.10.tar.Z.  The sixth match is the
	UNIX manual page for Netfind.  The remaining matches are
	PostScript papers in which Netfind is discussed.

	In WAIS, a user retrieves files by selecting a matching
	headline.  With Essence, if the headline represents a file
	hidden within a nested file (such as the first headline in the
	example), the summary file is retrieved, instead of retrieving
	the hidden file itself.  If the headline represents a plain
	file (such as the fourth headline in the example), the summary
	file is also retrieved.  This functionality requires allocating
	storage for both the required summary files and the index.
	However, it allows users to browse through remote file systems
	by retrieving and viewing small summary files without having to
	retrieve complete files.  This is useful when trying to decide
	whether to transfer large files across a slow network.  
" 
)