|
COPYRIGHT 2001 Institute for Operations Research and the Management Sciences
Retrieving information from heterogeneous database systems involves a complex process and remains a challenging research area. We propose a cognitively guided approach for developing an information-retrieval agent that takes the user's information request, identifies relevant information sources, and generates a multidatabase access plan. Our work is distinctive in that the agent design is based on an empirical study of how human experts retrieve information from multiple, heterogeneous database systems. To improve on empirically observed information-retrieval capabilities, the design incorporates mathematical models and algorithmic components. These components optimize the set of information sources that need to be considered to respond to a user query and are used to develop efficient multidatabase-access plans. This agent design, which integrates cognitive and mathematical models, has been implemented using Soar, a knowledge-based architecture.
(Global Query Optimization; Heterogeneous Database Systems; Information Integration; Maximal Objects; Soar-Based AI Systems; Universal Relational Model)
1. Introduction
The Internet has made the multidatabase problem (i.e., retrieving data from multiple databases) more relevant and challenging. Retrieving information from multiple databases generally involves complexities that do not exist when retrieving information from a single database (Kim et al. 1991). First, query formulation requires knowledge of multiple databases to identify the database sources required to answer the query. Second, data-access planning (also referred to as data-access path selection in the database literature) in a multidatabase environment is more complex than in a single database because the order in which information is accessed from multiple component databases can have critical performance implications.
Building on the mediator-based approach (Wiederhold 1992), we propose novel solutions at the logical design level to these two key problems encountered in the development of information-retrieval agents. (1) In contrast to previous work on this topic, our proposals are based on an empirical model of how human experts retrieve information from multiple heterogeneous databases. However, for a variety of well-known reasons, such as bounded rationality, human experts do not display "optimal" problem-solving behavior (Simon 1955). To deal with this limitation, we develop mathematical optimization algorithms and incorporate them into the human process model of information retrieval. To set the context for the rest of the paper, we begin with an example that illustrates the capabilities we desire in our information retrieval agent, InfoB (short for Information Broker).
Consider a set of autonomous relational databases, MillionDollarDirectory, Manufacturers, Distributors, and Bank, whose schemata are shown in Figure 1. For convenience, we refer to this set of component databases as the "commerce-component databases" as they can be used to support commercial transactions.
To find all computer manufacturers whose gross sales are greater than 100 million, the user formulates a query that retrieves "Large Computer Manufacturers" as shown in Figure 2. This query is submitted to InfoB.
Note that the SQL query does not contain the FROM clause. The user, therefore, is not required to be aware of the logical structure of the component databases. InfoB processes the query and solves the database source identification problem--namely identification of the databases and the specific data objects that need to be accessed to answer the query. For the query in Figure 2, InfoB identifies the MillionDollarDirectory and Manufacturers databases and the tables in these databases as relevant, and formulates the query shown in Figure 3.
We refer to this query as a global query because it references data objects in more than one database. For this reason, it cannot be evaluated directly in a single conventional database. To evaluate the global query, InfoB develops an access plan. It decomposes the global query into subqueries that can be evaluated on each of the relevant databases. It then determines the order in which they should be evaluated and how their results should be combined to respond to the global query. These subqueries are shown in Figure 4. InfoB returns the intersection of the set of company names and manufacturer names determined by these queries as the response to the user query in Figure 2.
Several questions need to be answered to endow InfoB with the capabilities illustrated in this example.
1. What knowledge should InfoB have about the component databases for which InfoB serves as a retrieval mediator, and how should this knowledge be structured? How should the different types of knowledge be organized to architect and implement InfoB?
2. Given this knowledge, how should InfoB determine the minimal set of data sources required to answer a given user query correctly?
3. How should InfoB evaluate global queries on these identified data sources to answer the user query?
We use a combination of methods to answer these questions:
* (Question 1) We use the results of an empirical study of information retrieval by human experts to identify the different types of knowledge required by InfoB. This includes metaknowledge about the component databases as well as process knowledge about the key steps involved in answering a user query. The representation scheme used to encode metaknowledge about component databases is derived from the universal relational model (Ullman 1989, Zhao et al. 1995). The process knowledge is encoded using the problem space model provided in Soar (Laird et al. 1983, Newell 1990).
* (Question 2) To improve the efficiency of data-source identification, we develop mathematical models and algorithmic components to determine the minimal set of data sources that are required to answer a user query correctly and efficiently. The models employ constructs derived from the Universal Relation Model, such as the maximal object and global hypergraph, and take into account the cost differences between interdatabase and intradatabase queries (Lu et al. 1992).
* (Question 3) The evaluation of a global query involves the formulation of an access plan. We propose a knowledge-based approach to query evaluation, based on observed human behavior, and we implement it in the Soar development environment.
The rest of the paper is structured as follows: [section] 2 provides an overview of several preliminary concepts including database-integration methodologies, the Universal Relation Model, our protocol analysis of human experts, and the InfoB problem space model. Section 3 discusses the knowledge architecture underlying InfoB. Section 4 describes the global universal relation used to represent the data sources in InfoB. Section 5 presents the mathematical models for the data-source identification problem. Section 6 overviews the state-space approach to access planning for global queries. Section 7 presents our work on validating InfoB through a comparison of its results with the results observed of human experts. Section 8 discusses the contributions and limitations of our study and outlines future research directions.
2. Preliminaries
2.1. A Brief Literature Review
There has been considerable interest, in both the practitioner and research communities, in addressing the problems encountered in retrieving information from multiple heterogeneous databases. Recent work in this area has focused primarily on problems related to schematic database heterogeneity (Singh 1998) and can be characterized into two broad streams: tight coupling and loose coupling approaches.
The tight coupling approach relies on integrated-database schemata. When the multidatabase system is created, a global/integrated schema or a federation of shared schemata are generated by the systems integrator (Sheth and Larson 1990, Zhu and Larson 1996). Examples of such systems include Multibase, Mermaid, Adds, Dataplex, Ingres/Star, Pegasus, WIND, and MDM (Ahmed et al. 1991, Bright et al. 1992, Mostardi and Siciliano 1994, Atzeni and Torlone 1997).
The advantage of the tight coupling approach is that it allows users to transparently access heterogeneous, autonomous, and distributed databases as they would a single database. However, implementing such an approach can be costly. The integrated schema must be redone each time there is a change in the schema of an existing component database or when a new database is added to the accessible pool. This makes the maintenance of the global schema very difficult. In an environment as dynamic as the Internet, it is virtually impossible to maintain such a global schema (Duschka and Genesereth 1997).
In contrast to the tight coupling approach, the loose coupling approach leaves the task of integration to the multidatabase users. Systems adopting this approach aim to provide tools to facilitate the integration task to be carried out by the user. Examples of such tools include the MultiDatabase manipulation language MDSL (Litwin and Abdellatif 1987), ALCHEMIST, an object-oriented tool to build transformations among heterogeneous-database representations (Henry and Lind 1994), and the InfoSleuth project (Bayardo et al. 1997), a system for finding and integrating information in corporate and external networks. Although this approach is very flexible in updating the component databases, it is essentially impossible to find users with all the knowledge required to perform the query formulation task when the number of component databases is large and varying.
The mediation/knowledge-based approach (Jeff and Tenenbaum 1991, Wiederhold 1992) combines elements of both the tight and loose coupling approaches. The basic idea is to replace the global schema with a knowledgeable mediator, primarily to support information retrieval. A mediator or agent in this context is defined as "a software module that exploits encoded knowledge about a certain set of databases to create information for a higher layer of applications" (Wiederhold 1992). It relieves the user from having to deal with database heterogeneity. The user interacts with an agent and expresses information requests in the agent's language (e.g., modified version of SQL). As illustrated in our motivating example, this only requires domain knowledge (Arens et al. 1993, Kashyap and Sheth 1994). Recently, the mediator-based approach has been applied to address the problem of accessing information sources that contain structured as well as unstructured data (Garcia-Molina et al. 1995, Hammer et al. 1997, Naacke et al. 1998, Vidal et al. 1998).
The applicability of the three approaches described above is determined by the size of the database pool, the extent of heterogeneity, the stability of the databases, and the level of sophistication of the users. In general, the tight coupling approach can be used in an environment where...
Read the full article for free courtesy of your local library.
|