AccessMyLibrary provides FREE access to millions of articles from top publications available through your library.
Create a link to this page
Copy and paste this link tag into your Web page or blog:
1. INTRODUCTION
1.1 Background and Motivation
Researchers and practitioners have been interested in distributed database systems since the 1970s. At that time, the main focus was on supporting distributed data management for large corporations and organizations that kept their data at different offices or subsidiaries. Although there was a clear need and many good ideas and prototypes (e.g., System R* [Williams et al. 1981], SDD-1 [Bernstein et al. 1981], and Distributed Ingres [Stonebraker 1985]), the early efforts in building distributed database systems were never commercially successful [Stonebraker 1994]. In some aspects, the early distributed database systems were ahead of their time. First, communication technology was not stable enough to ship megabytes of data as required for these systems. Second, large businesses somehow managed to survive without sophisticated distributed database technology by sending tapes, diskettes, or just paper to exchange data between their offices.
Today, the situation has changed dramatically. Distributed data processing is both feasible and needed. Almost all major database system vendors offer products to support distributed data processing (e.g., IBM, Informix, Microsoft, Oracle, Sybase), and large database application systems have a distributed architecture (e.g., business application systems such as Baan IV, Oracle Finance, Peoplesoft 7.5, and SAP R/3). Distributed data processing is feasible because of recent technological advances (e.g., hardware, software protocols, standards). Distributed data processing is needed because of changing business requirements, which have made distributed data processing cost-effective and in certain situations the only viable option. Specifically, businesses are beginning to rely on distributed rather than centralized databases for the following reasons:
1. Cost and scalability. Today, one thousand PC processors are cheaper and significantly more powerful than one big mainframe computer So, it makes economic sense to replace a mainframe by a network of small, off-the-shelf processors. Furthermore, it is very difficult to "up-size" a mainframe computer if a company grows, while new PCs can be added to the network at any time in order to meet a company's new requirements. High availability can be achieved by mirroring (replicating) data.
2. Integration of different software modules. It has become clear that no single software package can meet all the requirements of a company. Companies must, therefore, install several different packages, each potentially with its own database, and the result is a distributed database system. Even single software packages offered by one vendor have a distributed, component-based architecture so that the vendor can market and offer upgrades for every component individually.
3. Integration of legacy systems. The integration of legacy systems is one particular example that demonstrates how some companies are forced to rely on distributed data processing in which their old legacy systems need to coexist with new modern systems.
4. New applications. There are a number of new emerging applications that rely heavily on distributed database technology; examples are workflow management, computer-supported collaborative work, tele-conferencing, and electronic commerce.
5. Market forces. Many companies are forced to reorganize their businesses and use state-of-the-art distributed information technology in order to remain competitive. As an example, people will probably not eat more Pizza because of the Internet, but a Pizza delivery service is definitely going to lose some of its market share if it does not allow people to order Pizza on the Web.
This list shows that there are many different reasons to rely on distributed architectures and correspondingly many different kinds of distributed systems exist. Sometimes it is only the software and not the hardware that is distributed. The purpose of this paper is to give a comprehensive overview of what query processing techniques are needed to implement any kind of distributed database and information system. It is assumed that users and application programs issue queries using a declarative query language such as SQL [Melton and Simon 1993] or OQL [Cattell et al. 1997] and without knowing where and in which format the data is stored in the distributed system. The goal is to execute such queries as efficiently as possible in order to minimize the time that users must wait for answers or the time application programs are delayed. To this end, we will discuss a series of techniques that are particularly effective to execute queries in today's distributed systems. For example, we will describe the design of a query optimizer that compiles a query for execution and determines the best possible way among many alternative ways to execute a query. We will also show how techniques such as caching and replication can be used to improve the performance of queries in a distributed environment. Furthermore, we will cover specific query processing techniques for client-server, middleware (multitier), and heterogeneous database and information systems, which represent architectures that are frequently found in practice.
1.2 Scope of this Paper and Related Surveys
A very large body of work in the general area of database systems exists. All this work can be roughly classified into work on architectures and techniques for transaction processing (i.e., quickly processing small update operations), work on query processing (i.e., mostly read operations that explore large amounts of data), and work on data models, languages, and user interfaces for advanced applications. In this paper, we will focus primarily on query processing. A discussion of transaction processing and of alternative data models is beyond the scope of this paper. Transaction processing has been thoroughly investigated in, for example, Gray and Reuter [1993]. Work on data models (relational, deductive, object-oriented, and semistructured) is described in Ullman [1988], Cattell et al. [1997], Abiteboul [1997], and Buneman [1997]. Also, we will assume that the reader is familiar with basic database system concepts, SQL, and the relational data model. Good introductory textbooks are Silberschatz et al. [1997] and Ramakrishnan [1997].
This paper will not even be able to give a full coverage of all query processing techniques used today; in particular, a number of query processing techniques for the World Wide Web are not discussed. For instance, we will not present the architecture of search engines such as AltaVista. Furthermore, there have been several proposals to manage Web sites and query a network of Web pages; see Florescu et al. [1998] for a survey. In addition, several proposals to manage and query XML data exist (e.g., McHugh and Widom [1999], Abiteboul et al. [1999], and Florescu et al. [1999]). Instead of going into the details of all these techniques, the focus of this paper is on fundamental mechanisms to process queries that involve data from several sites. We will, therefore, concentrate on structured data (such as that found in relational or object-oriented databases) and on query languages for structured data (such as SQL or OQL). Nevertheless, the techniques described in this paper are also relevant to process other kinds of data in a distributed environment.
A parallel database system is a particular type of distributed system. Distributed and parallel database systems share several properties and goals--in particular, if the parallel system has a so-called "shared-nothing" architecture [Stonebraker 1986]. The purpose of a parallel database system is to improve transaction and query response times, and the availability of the system for centralized applications. Parallel systems, therefore, emphasize the cost/scalability arguments described above, while the distributed systems discussed in this paper often address issues such as the heterogeneity of components. While some query processing techniques are useful for both kinds of systems, researchers in both areas have developed special-purpose techniques for their particular environment. In this paper, we will concentrate on the techniques that are of interest for distributed database systems, and will not discuss techniques which are specifically used in parallel database systems (e.g., special parallel join methods, repartitioning of data during query execution, etc.). An excellent overview on parallel database systems is given in DeWitt and Gray [1992].
In terms of related work, there have been several surveys on distributed query processing; for example, a paper by Yu and Chang [1984] and parts of the books by Ceri and Pelagatti [1984], Ozsu and Valduriez [1999], and Yu and Meng [1997] are devoted to distributed query processing. These surveys, however, are mostly focused on the presentation of the techniques used in the early prototypes of the 1970 and 1980. While there is some overlap, most of the material presented in this paper is not covered in those articles and books simply because the underlying technology and business requirements have significantly changed in the last few years.
1.3 Organization of this Paper
This paper is organized as follows:
* Section 2. presents the textbook architecture for query processing and a series of basic query execution techniques that are useful for all kinds of distributed database systems
* Section 3. takes a closer look at query processing for one particular and very important class of distributed database systems: client-server database systems
* Section 4. deals with the query processing issues that arise in heterogeneous database systems, that is, systems that are composed of several autonomous component databases with different schemas, varying query processing capabilities, and application programming interfaces (APIs)
* Section 5. shows how data placement (i.e., replication and caching) and query processing interact and shows how data can dynamically and automatically be distributed in a system in order to achieve good performance
* Section 6. describes other emerging and promising architectures for distributed data processing; specifically, this section gives an overview of economic models for distributed query processing and dissemination-based information systems
* Section 7. contains conclusions and summarizes open problems for future research.
2. DISTRIBUTED QUERY PROCESSING: BASIC APPROACH AND TECHNIQUES
In this section, we will describe the "textbook" architecture for query processing and present a series of specific query processing techniques for distributed database and information systems. These techniques include alternative ways to ship data from one site to one or several other sites, implement joins, and carry out certain kinds of queries in a distributed environment. The purpose of this section is to give an overview of basic mechanisms that can be used in any kind of distributed database system. In Sections 3. and 4., we will discuss the techniques that are particularly useful for certain classes of distributed database systems (i.e., client-server and heterogeneous database systems).
2.1 Architecture of a Query Processor
Figure 1 shows the classic "textbook" architecture for query processing. This architecture was used, for example, in IBM's Starburst project [Haas et al. 1989]. This architecture can be used for any kind of database system including centralized, distributed, or parallel systems. The query processor receives an SQL (or OQL) query as input, translates and optimizes this query in several phases into an executable query plan, and executes the plan in order to obtain the results of the query. If the query is an interactive ad hoc query (dynamic SQL), the plan is directly executed by the query execution engine and the results are presented to the user. If the query is a canned query that is part of an application program (embedded SQL), the plan is stored in the database and executed by the query execution engine every time the application program is executed [Chamberlin et al. 1981]. Below is a brief description of each component of the query processor.
[ILLUSTRATION OMITTED]
Parser. In the first phase, the query is parsed and translated into an internal representation (e.g., a query graph [Jenq et al. 1990; Pirahesh et al. 1992]) that can be easily processed by the later phases. The development of parsers is well understood [Aho et al. 1987], and tools like flex and bison can be used for the construction of SQL or OQL parsers just as for most other programming languages. The same parser can be used for a centralized and distributed database system.
Query Rewrite. Query rewrite transforms a query in order to carry out optimizations that are good regardless of the physical state of the system (e.g., the size of tables, presence of indices, locations of copies of tables, speed of machines, etc.) [Pirahesh et al. 1992]. Typical transformations are the elimination of redundant predicates, simplification of expressions, and unnesting of subqueries and views. In a distributed system, query rewrite also selects the partitions of a table that must be considered to answer a query [Ceri and Pelagatti 1984; Ozsu and Valduriez 1999]. Query rewrite is carried out by a sophisticated rule engine [Pirahesh et al. 1992].
Query Optimizer. This component carries out optimizations that depend on the physical state of the system. The optimizer decides which indices to use to execute a query, which methods (e.g., hashing or sorting) to use to execute the operations of a query (e.g., joins and group-bys), and in which order to execute the operations of a query. The query optimizer also decides how much main memory to allocate for the execution of each operation. In a distributed system, the optimizer must also decide at which site each operation is to be executed. To make these decisions, the optimizer enumerates alternative plans (described below) and chooses the best plan using a cost estimation model. Almost all commercial query optimizers are based on dynamic programming in order to enumerate plans efficiently. Dynamic programming and considerations for cost estimation in a distributed system are described in more detail in Section 2.2.
Plan. A plan specifies precisely how the query is to be executed. Probably every database system represents plans in the same way: as trees. The nodes of a plan are operators, and every operator carries out one particular operation (e.g., join, group-by, sort, scan, etc.). The nodes of a plan are annotated, indicating, for instance, where the operator is to be carried out. The edges of a plan represent consumer-producer relationships of operators. Figure 2 shows an example plan for a query that involves Tables A and B. The plan specifies that Table A is read at Site 1 using an index (the idxscan(A) operator), B is read at Site 2 without an index (the scan(B) operator), A and B are shipped to Site 0 (the send and receive operators), B is materialized and reread at Site 0 (the temp and scan operators), and finally, A and B are joined at Site 0 using a nested-loop join method (the NLJ operator). The send and receive operators encapsulate all the communication activity so that all other operators (e.g., NLJ or scan) can be implemented and used in the same way as in a centralized database system.
[ILLUSTRATION OMITTED]
Plan Refinement / Code Generation. This component transforms the plan produced by the optimizer into an executable plan. In System R, for example, this transformation involves the generation of an assembler-like code to evaluate expressions and predicates efficiently [Lorie and Wade 1979]. In some systems, plan refinement also involves carrying out simple optimizations which are not carried out by the query optimizer in order to simplify the implementation of the query optimizer.
Query Execution Engine. This component provides generic implementations for every operator (e.g., send, scan, or NLJ). All state-of-the-art query execution engines are based on an iterator model [Graefe 1993]. In such a model, operators are implemented as iterators and all iterators have the same interface. As a result, any two iterators can be plugged together (as specified by the consumer-producer relationship of a plan), and thus, any plan can be executed. Another advantage of the iterator model is that it supports the pipelining of results from one operator to another in order to achieve good performance.
Catalog. The catalog stores all the information needed in order to parse, rewrite, and optimize a query. It maintains the schema of the database (i.e., definitions of tables, views, user-defined types and functions, integrity constraints, etc.), the partitioning schema (i.e., information about what global tables have been partitioned and how they can be reconstructed), and physical information such as the location of copies of partitions of tables, information about indices, and statistics that are used to estimate the cost of a plan. In most relational database systems, the catalog information is stored like all other data in tables. In a distributed database system, the question of where to store the catalog arises. The simplest approach is to store the catalog at one central site. In wide-area networks, it makes sense to replicate the catalog at several sites in order to reduce communication costs. It is also possible to cache catalog information at sites in a wide-area network [Williams et al. 1981]. Both replication and caching of catalog information are very effective because catalogs are usually quite small (hundreds of kilobytes rather than gigabytes) and catalog information is rarely updated in most environments. In certain environments, however, the catalog can become very large and be frequently updated. In such environments, it makes sense to partition the catalog data and store catalog data where it is most needed. For example, catalogs of distributed object databases need to know where copies of all the objects (potentially millions) are stored, and they need to update this information every time an object is migrated or replicated. Such catalogs can be implemented in a hierarchical way as described in Eickler et al. [1997].
It should be noted that the architecture shown in Figure 1 and described in this subsection is not the only possible way to process queries. There is no such thing as a perfect query processor. An alternative architecture has, for example, been developed by Graefe and others as part of the Exodus, Volcano, and Cascades projects [Graefe 1995; Graefe and McKenna 1993; Graefe and DeWitt 1987], and is used in several commercial database products (e.g., Microsoft SQL Server 7.0). In that architecture, query rewrite and query optimization are carried out in one phase. Furthermore, there have been proposals to optimize a set of queries rather than individual queries [Sellis 1988]. The advantage of such an approach is that common subexpressions (e.g., joins) that are part of several queries need only be carried out once for the whole set of queries.
2.2 Query Optimization
We now turn to a description of techniques that can be used to implement the query optimizer of a distributed database system. We will first describe the most popular enumeration algorithm for query optimization. After that, we will describe two cost models that can be used to estimate the cost of a plan.
2.2.1 Plan Enumeration with Dynamic Programming. A large number of alternative enumeration algorithms have been proposed in the literature; Steinbrunn et al. [1997] contains a good overview, and Kossmann and Stocker [2000] evaluate the most important algorithms for distributed database systems. In the following, dynamic programming is described. This algorithm is used in almost all commercial database products, and it was pioneered in IBM's System R project [Selinger et al. 1979]. The advantage of dynamic programming is that it produces the best possible plans if the cost model is sufficiently accurate. The disadvantage of this algorithm is that it has exponential time and space complexity so that it is not viable for complex queries; in particular, in a distributed system, the complexity of dynamic programming is prohibitive for many queries. An extension of the dynamic programming algorithm is known as iterative dynamic programming. This extended algorithm is adaptive and produces as good plans as basic dynamic programming for simple queries and "as good as possible plans" for complex queries for which dynamic programming is not viable. We do not describe this extended algorithm in this paper and refer the interested reader to Kossmann and Stocker [2000].
The basic dynamic programming algorithm for query optimization is shown in Figure 3. It works in a bottom-up way by building more complex (sub-) plans from simpler (sub-) plans. In the first step, the algorithm builds an access plan for every table involved in the query (Lines 1 to 4 of Figure 3). If Table A, for instance, is replicated at sites [S.sub.1] and [S.sub.2], the algorithm would enumerate scan(A, [S.sub.1]) and scan(A, [S.sub.2]) as alternative access plans for Table A. Then, the algorithm enumerates all two-way join plans using the access plans as building blocks (Lines 5 to 13). Again, the algorithm would enumerate alternative join plans for all relevant sites, that is, consider carrying out joins with A at [S.sub.1] and [S.sub.2]. Next, the algorithm builds three-way join plans, using access-plans and two-way join plans as building blocks. The algorithm continues in this way until it has enumerated all n-way join plans which are complete plans for the query, if the query involves n tables.
Fig. 3. Dynamic programming algorithm for query optimization.
Input: SPJ query q on relations [R.sub.1], ..., [R.sub.n]
Output: A query plan for q
1: for i= 1 to n do {
2: optPlan({[R.sub.i]}) = accessPlans([R.sub.i])
3: prunePlans(optPlan({[R.sub.i]}))
4: }
5: for i - 2 to n do {
6: for all S [subset or equal to] {[R.sub.1], ..., [R.sub.n]}
such that |S| = i do {
7: optPlan(S) = 0
8: for all O [subset] S do {
9: optPlan(S) = optPlan(S) [union] joinPlans(optPlan(O),
optPlan(S = 0))
10: prunePlans(optPlan(S))
11: }
12: }
13: }
14: return optPlan({[R.sub.1], ..., [R.sub.n]})
The beauty of the dynamic programming algorithm is that inferior plans are discarded (i.e., pruned) as early as possible (Lines 3 and 10). A plan can be pruned if an alternative plan exists that does the same or more work at a lower cost. Dynamic programming, for example, would enumerate A ?? B and B ?? A as two alternative plans to execute this join, but only the cheaper of the two plans would be kept in the optPlan(A, B) structure after pruning. Pruning significantly reduces the complexity of query optimization; the earlier inferior plans are pruned, the better because more complex plans are not constructed from such inferior plans.
In a distributed system, neither scan(A, [S.sub.1]) nor scan(A, [S.sub.2]) may be immediately pruned in order to guarantee that the optimizer finds a good plan. Both plans do the same work, but they produce their results at different sites. Even if scan(A, [S.sub.1]) is cheaper than scan(A, [S.sub.2]), scan(A, [S.sub.2]) must be kept because it might be a building block of the overall best plan if, for instance, the query results are to be presented at [S.sub.2]. Only if the cost of scan(A, [S.sub.1]) plus the cost of shipping A from [S.sub.1] to [S.sub.2] is lower than the cost of scan(A, [S.sub.2]), scan(A, [S.sub.2]) is pruned. In general, a plan [P.sub.1] may be pruned if there exists a plan [P.sub.2] that does the same or more work and the following criterion holds:
(1) [inverted]A [element of] interesting_sites([P.sub.1]):cost (ship( [P.sub.1] , i )) [is greater than or equal to] cost (ship([P.sub.2], i))
Here, interesting_site denotes the set of sites that are potentially involved in processing the query; the concept is formally defined in Kossmann and Stocker [2000], who also show how this expression can be evaluated efficiently during query optimization under certain conditions. Ganguly et al. [1992] describes further adaptions to the pruning logic that need to be considered if a response time cost model is used (Section 2.2.2).
In the literature, there has been a great deal of discussion concerning bushy or (left-) deep join plan enumeration [Ioannidis and Kang 1991; Lanzelotte et al. 1993; Schneider and DeWitt 1990]. Deep plans are plans in which every join involves at least one base table. Bushy plans are more general; in a bushy plan, a join could involve one or two base tables or the result of one or two other join operations (for instance, the plans of Figure 4 are bushy). The algorithm shown in Figure 3 enumerates all bushy plans, and taking all bushy plans into account is also the approach taken in most commercial database systems. The best plan to execute a query is often bushy and not deep; in particular in a distributed system [Franklin et al. 1996].
[ILLUSTRATION OMITTED]
2.2.2 Cost Estimation for Plans. The Classic Cost Model The classic way to estimate the cost of a plan is to estimate the cost of every individual operator of the plan and then sum up these costs [Mackert and Lohman 1986]. In this model, the cost of a plan is defined as the total resource consumption of the plan. In a centralized system, the cost of an operator is composed of CPU costs plus disk FO costs. The disk I/O costs, in turn, are composed of seek, latency, and transfer costs. In a distributed system, communication costs must also be considered; these costs are composed of fixed costs per message, per-byte costs to transfer data, and CPU costs to pack and unpack messages at the sending and receiving sites. The costs can be weighted in order to model the impact of slow and fast machines and communication links; for example, it is more expensive to ship data from Passau (Germany) to Washington (USA) than from Passau to Munich (Germany). Also, high weights are assigned to the CPU instructions and disk I/O operations that are carried out by heavily loaded machines. As a result, the optimizer will favor plans that carry out operators at fast and unloaded machines and avoid expensive communication links, wherever possible.
Response Time Models. The classic cost model that estimates the total resource consumption of a query is useful to optimize the overall throughput of a system: if all queries consume as few resources as possible and avoid heavily loaded machines, then as many queries as possible can be executed in parallel. The classic cost model, however, does not consider intraquery parallelism, so an optimizer based on this cost model will not necessarily find the plan with the lowest response time for a query in cases in which machines are lightly loaded and communication is fast.
To give an example that demonstrates the difference between the total resource consumption and the response time of a plan, consider the two plans of Figure 4. Assuming that the costs of join processing are …