Electronic librarians, intelligent network agents, and information catalogues

Draft paper by Edward A. Fox

1. Rationale for Approach

People need information. It is one of the most basic requirements for survival; especially for those in the Information Age. How can we use computers, the most powerful devices yet devised, to enhance human intelligence, in particular to help us organize, find, get, and utilize relevant information? How can we specify such a computer solution to the universal problem of information need? How can we make intelligent digital libraries (Fox, 1994)?

Scenario-based design (Carroll & Rosson, 1992) is one of the best ways to develop a solution. Since we seek an optimal design, we posit a set of oracles that can undertake the desired services, with the support of the best possible description of the available collection of information. Thus, our oracles can serve as ideal librarians and intelligent agents, and can work with a perfect information catalog.

By considering upper-bound or best-of-all-possible-worlds scenarios, we should be able to develop requirements for an ideal computerized system to help with today's networked information systems. Those requirements are assured to be relevant to our problem, by construction, if we assume that time-honored efforts at cataloging by library scientists, on the one hand, and modern work to build intelligent intermediaries and agents by computer scientists, on the other hand, are sensible.

2. From Centralized to Distributed

Discourse analysis techniques have been used to study what happens when an expert intermediary assists an end-user (Belkin et al., 1987a). The basic scenario is that a user approaches a librarian, who helps that user find relevant information. In the optimal case, the user works with an oracle (i.e., expert librarian) who: helps elaborate the problem (or anomalous state of knowledge), describes the user and user background, understands the topic or subject, constructs queries or other requests for desired information, knows how to access and present the relevant information, and provides explanations as appropriate. While scenarios (Belkin et al., 1983) and systems (Belkin et al., 1987b) have been devised for this situation, even though the computer solution calls for a distributed expert-based information system, the overall architecture of (user)-(intermediary)-(information-base) is still fundamentally centralized.

To build a Global Information Infrastructure we must deal instead with a distributed architecture, possibly like that adopted by the Hyper-G group (Maurer, 1996). In the conventional situation this involves the following chains of communication:

(user)-(intermediary)-(catalogs)

and

(user)-(collection-of-intermediaries)-(collection-of-information-bases).

In the electronic version of this situation, "electronic librarians" will replace the human search intermediaries who assist end-users, "intelligent network agents" will replace the collection of intermediaries and other librarians who work with catalogs and indexes, and "information catalogs" will help with resource discovery and information retrieval. Then we can serve: patrons of libraries scattered around the globe, Internet clients dealing with the set of thousands or millions of disparate information servers, or, in the most general case, users of the emerging worldwide (virtual) digital library.

3. Functional Requirements

For these services to be provided, we must specify the new functions to be carried out. They deal with: constructing catalogs, searching in catalogs, discovering desirable resources through the use of catalogs, effectively utilizing the information network, searching in parallel or in a serial sequence through the distributed collection of information resources (as in the Networked Computer Science Technical Report Library, at http://www.ncstrl.org), and fusing or combining results from distributed searches (Belkin et al., 1995; Viles & French, 1995). Because agents are involved there are additional functions, such as building of knowledge representations by agents for record keeping and continuation of sessions, interchanging knowledge between agents to facilitate their cooperative pursuit of user goals (Prasad et al., 1996), and communicating directly or indirectly between users and agents. In the context of digital libraries, there are additional demands from information providers regarding: market analysis, economic modeling, quality control, intellectual property rights management, copyright protection, subscription handling, personalization for users, version control, usage analysis, and tuning of services offered. In the context of collaboration among users there are further requirements regarding group access control, tailored information sharing, communication between user agents, and support for distributed problem solving (Prasad et al., 1996). For further details on the requirements for digital libraries, see (Gladney et al., 1996).

4. Scenarios for the Humanities

Designing the ideal digital library system (IDLS) to support the humanities can be accomplished in part by providing detailed scenarios involving: expert electronic librarians, intelligent network agents, and information catalogs. The scenarios below deal with a sampling of representative situations, focusing on the handling of textual information.

First, consider the work of constructing a new dictionary. Picture a talented lexicographer working with a collection of oracles to construct a new general purpose dictionary of English. One oracle takes the job of collecting works to make up a representative corpus. Using the IDLS's catalog of entries, it samples them to provide a suitable mixture of genres and to balance according to economic, educational, geographic, political, religious, and social distributions of the target population. The catalog's agent helps with the sampling and balancing tasks, calling on agents that work with the various aspects of user models (and so keep track of statistics about the target population). Each significant work selected for inclusion in the corpus has its own "work agent." Each work agent has collected data on the most commonly accessed parts of its work. The work agent can use that data along with a word-sense level inverted file to find the best quotations and usage examples available to accompany the definition for any given word. A clustering agent works with the word-sense disambiguated concordance-type data extracted from the full set of works, to assemble the raw material for constructing dictionary entries. A browsing agent coordinates with the clustering agent and an entry agent to allow the lexicographer to experiment with groupings, orderings, annotation and hypertext-style linking back to the contexts of appearance in the corpus. An assembly agent identifies all the parts for the dictionary entry, generates suitable SGML according to a Document Type Definition (DTD) based on the TEI guidelines (Ide & Veronis, 1995), and produces links to other dictionary entries as well as to the original works in the corpus. Finally, an editor agent communicates with the lexicographers and the dictionary editor, ensuring consistency and high quality in both the electronic and print versions of the dictionary.

Second, consider the task of assembling and analyzing the works of a great writer. A humanities scholar is the world's expert on Miss X, and wishes to assemble and use an electronic archive of the works of X along with all the relevant interpretive commentary, background (cultural, historic, and social) information, correspondence, and biographical documentation. That archive must be defined by an electronic catalog of suitable works, which can be derived from the IDLS catalog by an agent that constructs subsidiary catalogs for specific areas of interest. The subsidiary agent works not only with the IDLS catalog but also with search agents whose expertise is to handle the short, semantically rich, highly structured catalog entries. The scholar cooperates with an electronic librarian through a number of sessions of searching and browsing to provide various result sets, which, after suitable culling, provide the basis for the subsidiary catalog for Miss X. Through a type of scatter-gather operation (Cutting et al., 1992), the subsidiary catalog is repeatedly expanded and refined to give optimal coherence and coverage. A citation network agent uses the transitive closure operation to chase down citation chains, and the literary clustering agent uses co-citation and bibliographic coupling measures to organize the many levels of interpretation. (This agent was previously "trained" on the Responsa collection, and so can easily sort out the various layers of interpretation and their dependencies.) A timeline agent prepares an historical organization for the corpus, and a variety of special purpose indexing agents prepare views for each genre found in the collection. Our humanities scholar can employ agents to help look for patterns (e.g., those found by Ide in studying Blake (Ide, 1989), or those related to repeating sequences (Siochi & Ehrich, 1991)), work with raw frequency data or higher level statistical analyses, browse at high speed and with great flexibility, construct a wide variety of hypertext trails, visualize new organizations, or prepare courseware (e.g., reading assignments, exploratory games, and test questions). A frequently asked questions (FAQ) agent monitors accesses to the archive, noting users' interests and what they find, as well as the additional material recommended by the humanities scholar, and prepares an optimally structured collection of questions/needs and replies/pointers. A collaboration agent helps groups of students, historians, writers and others to work together to discover new perspectives, that can in turn be added to the emerging, dynamic digital library for X.

Finally, consider developing a digital library from the research of graduate students. ETD International is a hypothetical non-profit corporation seeking to facilitate graduate education as well as technology and knowledge transfer for the benefit of society. Its focus is on electronic theses and dissertations (ETDs). It maintains a comprehensive electronic catalog of all graduate student publications and an electronic archive containing them all. It constructs agents to help with ETD preparation, review, abstracting, cataloging, archiving, searching, browsing, and reuse. There are agents to aid the authors, agents to help members of graduate committees, agents to help graduate schools enforce local requirements, agents to transform chapters to journal articles or vice versa, agents to assist in the preparation of talks and presentations of the main findings of theses, agents to help with the production of reference sections as well as the merger of those into a union bibliography (with automatic annotations, extracted from "related works" chapters and then suitably merged together), and agents to help instructors reuse parts of ETDs (e.g., tables, figures) for standard courses or advanced seminars. Translation agents help authors convert typewritten ETDs or those prepared with a standard word processor into a suitably marked up SGML form, using ETD International's family of DTDs that are based on the TEI guidelines (Sperberg-McQueen & Burnard, 1994). Specialized translation agents are called in to assist with difficult cases, like those of: complex tables, long mathematical proofs, computer algorithms in any of a number of programming languages (possibly enhanced with algorithm visualizations), virtual reality explorations, flexible simulations, densely linked hyperbases, long animations, or interwoven literary analyses. Database agents manage the survey results, other raw data, statistical analysis conclusions, derived (2- and 3-D) graphic summaries, and spreadsheets that provide the support for many ETDs. Hypermedia agents manage the various multimedia objects, their interrelationships (e.g., time based, link based, or semantic network based), and their coordinated presentation on a variety of output devices with a broad range of size and quality demands. Transfer agents, tailored to various types of users and applications, support technology transfer to scientific or engineering companies as well as knowledge transfer to "think tanks", policy boards, professional associations and other scholarly groups. Many of the ETD International developed agents inherit methods of search from a generic agent that looks for an "object" in an ETD, according to its structure, content, link pattern, and context of appearance, specified in a general purpose HyTime/SGML-oriented query language. Those interested in the ETD archive can purchase and personalize an agent to ensure notification at time of publication of any new ETD which, in whole or part, is likely to be considered relevant. Many journal publishers employ such agents to alert them regarding prospective submissions, as do book publishers interested in the latest findings. Well known researchers employ slightly modified agents of this type so they can find new scholars to: help liven up panel discussions at conferences with exciting new results, recruit for their newest funded projects, or hire into post-doc positions or tenure-track faculty slots.

5. Examples of "Intelligent" Behavior

The ideal digital library system should behave "intelligently". We consider "intelligence" to be of value in a number of ways. Two of these seem of greatest importance. First, there should be no obvious mistakes. In large part, this means that an extensive set of cases should be stored, accompanied by the proper action or handling. For example, "stemming" agents used in information retrieval should be replaced with linguistically correct routines to avoid losses in recall due to understemming (e.g., not conflating "woman" and "women") and losses in precision due to overstemming (e.g., conflating "analytical" and "analyses"). Similarly, stop word removal and tokenization should be properly coordinated so that surface forms like "AT&T" are connected to the correct corporate entity instead of discarded after reduction to the word "at" and the character "t". When a user desires to expand a search, and an agent offers a set of new terms to consider adding to the query, it should ensure that those terms share a relevant semantic field with what has appeared before (e.g., not suggest "deed" when relevant documents are about "intellectual property rights"). If a cluster agent makes use of citation linkages, it should be aware of the anomalies caused by self-citation, and only include the closest works of a given author, as opposed to all those with strong bibliographic coupling. Likewise, if a search agent is aware that a user can only read a few languages and has no interest in publications written in other tongues (that have no translations), it should avoid returning non-English results that most certainly will be discarded.

Second, "intelligence" is manifested by tools that learn about users and their actions. Thus, at the interface level, expert users should be rewarded for their devoted use of the system by having an agent not only suggest completion of commands but also of command sequences, as well as offer to repeat complex analyses on new samples of data. Similarly, database management and visualization agents should offer to help with presentation and exploration of data sets similar to those previously studied by the user. Search agents should become familiar with which of the myriad information resources seem to be of greatest value for particular types of explorations, focus their attention on working with those collections, and give priority in "fusing" result sets to those "sources" that are likely to be most productive. Cluster agents should determine which collections seem best to satisfy the "cluster hypothesis" (El-Hamdouchi & Willett, 1987), and return a group of nearest neighbors from those, while returning single matches from other collections that seem too heterogeneous to yield good clusters. Caching agents should tune the contents of personal and work group disk caches to maximize the likelihood that works that are repeatedly used by an individual, or which will be accessed by a number of people in the same department, are saved. At the same time, caching agents must learn from author agents when a work is likely to be changed, so that old versions are not inadvertently reused. View agents should discover and present search results considering the types and levels of work of interest to each (class of) user, e.g., students who prefer survey articles, scholars who prefer leading-edge findings, educators who prefer overviews or works with numerous examples or graphic aids, historians who like timelines or analyses of important events, mathematicians who prefer formal treatments, human factors experts who focus on experimental investigations, social scientists who prefer case studies, or engineers who focus on failure analysis. Publisher agents should report back to information providers any trends regarding how their publications are perceived, what areas new publications should be started in, what old publications should be eliminated or re-oriented, and what user communities might like their services. These agents are representative of the broader class of "trend agents" that should observe patterns of usage over time, reflected in: changes in citation linkages, numbers of publications in various topical areas, selection preferences of users from among search result sets, transition probability shifts among hypertext links going out from popular browsing "nodes" (e.g., WWW pages), and what works are the targets of new links being added to the digital library. Special agents of this type might look for violations of intellectual property rights, unlawful copying, inappropriate derivative works, and instances of plagiarism. Coaching agents might advise users of better ways to carry out their tasks, to improve efficiency and/or effectiveness. In some cases these might play the role of tutor, pointing out hypertext trails of use to others who have investigated the same works. In other cases these might help a user construct a focused bibliography and usage-optimized hyperbase with the right links in the right places, essentially building an adaptive interface. For various groups of users a similar agent might construct one of the following: an annotated bibliography, an electronic newspaper or magazine, a searchable collection integrated with a multi-level table of contents, or an electronic version of a short or semester-long course.

6. Realistic Digital Library Support for the Humanities

While many of the above mentioned brief scenarios are of general applicability, some are of particular value for the humanities. Further, some of the scenarios could occur today with existing support tools, others might be possible with minor development, and some will not be possible unless major breakthroughs occur in research labs. Let us consider how to sort out these various cases.

First, there are scenarios related to representation, a key aspect of computing, and an important area of study in artificial intelligence. Information cataloging and catalog construction are clearly well advanced, and as new initiatives progress that deal with "handles" and metadata, are likely to be quickly integrated into emerging digital libraries. More to the point, though, we are moving rapidly into a period where SGML will be widely understood, but we still lack many of the important tools needed for authors, those converting old works, those specifying new DTDs, and those managing large corpora. Humanists have an important role to play in cases like development of ETD archives, since the next generation of scholars could quickly become knowledgeable contributors if required to prepare an ETD instead of or in addition to a paper thesis. Unfortunately, there are strong trends toward lowest common denominator solutions, like page representations (e.g., bitmaps or PDF) or presentation oriented markup (e.g., WYSIWYG or even some applications of HTML). Luckily, movements like "writing across the curriculum" have an exciting opportunity to teach students about descriptive markup and hypertext authoring, so that more flexible and powerful representations of works are created directly by authors. This will save the expense and errors that come from downstream efforts to translate or convert old works into a digital library. Getting back to our scenarios, it is important that tools (e.g., agents) be developed to support authoring. The ETD scenario is one that deserves funding and widespread implementation, perhaps first in the humanities areas.

Second, there are scenarios related to searching. Many of these are possible in the near term if a moderate amount of development were to take place. In particular, retrieval systems must be redesigned in more modular fashion and made to operate in a distributed environment. This will call for in-depth design work, adaptation of formats and languages like KIF and KQML (Genesereth & Ketchpel, 1994), and close coordination between research groups and providers of commercially available digital library and search systems. Some simple electronic librarians will therefore be built in the next several years, but ones that begin to provide the support of a good librarian are still years away.

Third, there are scenarios related to interfaces. While simple improvements are possible in the near term, most of the scenarios discussed will only be possible after a number of years of research and development. Good interface development methods and tools will be required, and careful formative and summative evaluation studies will be essential.

Fourth, there are scenarios related to tool support. Some of these are possible in the next three years, at least at the level of rough coupling of existing tools with emerging digital libraries. Better integration is needed, however, before significant impact is felt on large user communities. It is likely that this will require a moderate amount of research and development, and really only achieve full fruition when coupled with better interfaces.

Fifth, there are scenarios related to user modeling and learning. Some of the obvious improvements will no doubt take place in the near term. Others will take a few more years, being tied to interface development and testing. A great deal of research is required for the more sophisticated agents that work with views, personalization, special presentations, and support of various kinds of exploration and analysis.

Finally, there are very difficult efforts required in all scenarios that deal with: natural language understanding, word sense disambiguation of large corpora, or complex interaction among diverse communities of agents. We must make a start in these areas, and may achieve breakthroughs within a decade, but these efforts will require either large new initiatives or some novel type of collaborative effort among groups that currently are exploring a variety of different approaches.

7. Future Prospects

In the case of small traditional libraries, expert librarians can be very helpful in serving homogeneous user communities. In the case of large traditional libraries --- those with significant holdings, many electronic resources, and connections to a wide variety of outside services --- providing good support for knowledge workers is very expensive and very difficult.

With digital libraries today, we are not even at the point of providing good support for small collections and homogeneous user groups. The first complete commercial digital library systems are just emerging. If they have flexible architectures, if they are carefully designed to eventually support the scenarios discussed above, they will evolve within a decade to provide good support for interested user communities, and provide new capabilities in terms of: hypermedia; integration of a variety of tools for search / visualization / analysis / reuse; limited personalization; and more usable interfaces.

Efforts toward digital libraries for the humanities should be coupled with broader digital library initiatives, such as the development of national libraries, the support of efforts toward electronic theses and dissertations, and the construction of general purpose tools for searching, browsing, authoring, markup, and interface development. In this way, the significant contributions from humanities computing (e.g., construction of corpora, improvements in lexicography, preparation of TEI guidelines) will continue to have a major impact on the unfolding of the worldwide (virtual) digital library of the future.

REFERENCES

(Belkin et al., 1983) Nicholas J. Belkin, T. Seeger, and G. Wersig. "Distributed Expert Problem Treatment as a Model for Information System Analysis and Design." Journal of Information Science, 5:153-167 1983.

(Belkin et al., 1987a) N. J. Belkin, H. M. Brooks, and P. J. Daniels. "Knowledge Elicitation Using Discourse Analysis." International Journal of Man-Machine Studies, 25, 1987.

(Belkin et al., 1987b) N. J. Belkin et al. "Distributed Expert-Based Information Systems: An Interdisciplinary Approach." Information Processing & Management, 23(5):395-409, 1987.

(Belkin et al., 1995) N. J. Belkin, P. Kantor, E. A. Fox and J. A. Shaw. "Combining the Evidence of Multiple Query Representations for Information Retrieval." Information Processing & Management, 31(3), 431-448, May-June 1995.

(Carroll & Rosson, 1992) Carroll, J. M., and M. B. Rosson. "Getting Around the Task-Artifact Framework: How to Make Claims and Design by Scenario." ACM Transactions on Information Systems 10.2 (1992): 181-212.

(Cutting et al., 1992) Douglas R. Cutting et al. "Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections." Proc. of the 15th Annual Int. Conf. on R&D in IR, SIGIR'92, Copenhagen, 318-329, 1992.

(El-Hamdouchi & Willett, 1987) A. El-Hamdouchi and P. Willett. "Techniques for the Measurement of Clustering Tendency in Document Retrieval Systems." J. Information Science, 13:361-365, 1987.

(Fox, 1994) E. Fox. "How to make intelligent digital libraries." In Methodologies for Intelligent Systems, Proceedings of the 8th International Symposium, ISMIS'94, Charlotte, NC, Oct. 1994. Lecture Notes in Artificial Intelligence 869, Springer-Verlag, Berlin, 27-38.

(Genesereth & Ketchpel, 1994) Michael R. Genesereth and Steven P. Ketchpel. "Software Agents." Communications of the ACM, 37(7):48-53, July 1994.

(Gladney et al., 1996) H. Gladney, Z. Ahmed, R. Ashany, N. Belkin, E. Fox and M. Zemankova. "Digital Library: Gross Structure and Requirements (Report from a Workshop)." In press for Electronic Publishing - Origination, Dissemination and Design Journal, 1996.

(Ide, 1989) Nancy Ide. "Meaning and Method: Computer-assisted Analysis of Blake." In Literary Computing and Literary Criticism: Theoretical and Practical Essays, ed. R. Potter, Univ. of Penn. Press, 123-144.

(Ide & Veronis, 1995) Nancy Ide and Jean Veronis. "Encoding Dictionaries." Computers and the Humanities, 29(2): 167-179, 1995.

(Maurer, 1996) Hermann Maurer, ed. Power to the Web! The Official Guide to Hyper-G Addison-Wesley 1996

(Prasad et al., 1996) M. V. Nagendra Prasad, Victor R. Lesser, and Susan Lander. Retrieval and Reasoning in Distributed Case Bases. Accepted for publication in J. of Visual Communication and Image Representation, 1996.

(Siochi & Ehrich, 1991) Antonio C. Siochi and Roger W. Ehrich. "Computer Analysis of User Interfaces based on Repetition in Transcripts of User Sessions." ACM Trans. on Information Systems, 9(4): 309-335, 1991.

(Sperberg-McQueen & Burnard, 1994) Guidelines for Electronic Text Encoding and Interchange (TEI P3). Text Encoding Initiative, Chicago, 1994.

(Viles & French, 1995) Charles L. Viles and James C. French. "Dissemination of Collection Wide Information in a Distributed Information Retrieval System." Proc. of the 18th Annual Int. ACM SIGIR Conf. on R&D in IR, SIGIR'95, Seattle, 12-20, 1995.