|
COPYRIGHT 2000 Association for Computing Machinery, Inc.
1. INTRODUCTION
This article describes the concept of tapes [Keleher 1999]: a new high-level abstraction that unifies the expression and implementation of a number of techniques for improving the performance of software distributed shared memory (SDSM) protocols. SDSM protocols support the abstraction of shared memory to parallel applications running on networks of workstations. The SDSM abstraction provides an intuitive programming model and allows applications to become portable across a broad range of environments. These environments can include clusters of inexpensive PCs and workstations, allowing a much better trade-off between price and performance to be achieved than with most hardware-supported shared-memory machines. While SDSM systems have primarily been used as an effective way to obtain cheap cycles, they are also useful in integrating machines with important resources (access to a sensor or a database, for example) into computations running on other machines. Finally, SDSM provides a uniform shared-memory abstraction over the small-scale multiprocessors that are becoming common in labs and on desktops. However, this level of abstraction prevents the application from improving performance by explicitly directing data movement. While it is relatively easy to get parallel applications working on current DSMs, it can be very difficult to achieve high performance.
Tapes can make this task easier by allowing the data movement to be directed by the application at a high level of abstraction. A tape is essentially an object that encapsulates an arbitrary number of updates to shared data. Tapes are created through calls to the tape library that start and stop recording of updates to shared data made by the local process. Once created, a tape provides a convenient way to manipulate the updates. The data referenced by a tape can be sent to another process. Tapes can be reshaped by changing the set of data to which they refer. Tapes can also be added and subtracted, allowing a single tape to describe any arbitrary set of updates.
As a quick example, Figure 1 shows a simple use of the tape mechanism. We defer detailed description of this example until the next section. Essentially, however, the example shows process [P.sub.1] modifying three shared pages while holding lock [L.sub.1], followed by [P.sub.2] acquiring the same lock and reading the same three pages.
[Figure 1 ILLUSTRATION OMITTED]
In a traditional invalidate protocol, [P.sub.1]'s modifications would cause all three pages to be invalidated at [P.sub.2]. The subsequent reads by [P.sub.2] would each cause remote page faults. Each fault is satisfied by retrieving a current copy of the faulting page from a remote processor, and hence implies at least one network RPC. After the data are returned and copied to the correct location, page protections are changed to allow the page to be accessed normally.
By including the code in italics, however, [P.sub.1] can record the accesses automatically, append the modified data to the lock grant message, and update, rather than invalidate, [P.sub.2]'s copy of the page. For each page fault thereby avoided, the system eliminates both local fault-handling overhead and network RPCs.
The key points of this example are the following. First, tapes allow sharing behavior to be captured at runtime. The system needs neither compiler cooperation nor extensive user interaction in order to determine exactly which pieces of shared data are accessed by [P.sub.1]. This is important because we do not assume any explicit associations between synchronization and shared data, just as no such associations are assumed in a typical multithreaded environment like Pthreads.
Second, moving the data with the lock is only a performance optimization; it cannot cause correctness to be violated. No damage is done if [P.sub.2] does not access either x, y, or z. Any additional pages accessed by [P.sub.2] will be demand-paged across the network when the pages are accessed.
While tapes could be used directly by applications, they are probably more useful when folded into specialized synchronization libraries. Such libraries can reduce the total application involvement to just the replacement of calls to generic synchronization primitives with calls to the corresponding routines in the new libraries. This indirection allows the synchronization implementation to be quite simple, without losing any generality. On the other hand, sophisticated middleware or application programmers can use tapes abstractions to directly improve performance.
The primary claimed advantage of SDSM systems over message-passing programming models is ease of use. By abstracting away any need to specify data locations, SDSM systems allow parallel and distributed applications to be more simply created. Requiring applications to contain additional annotations would seem to run counter to this goal. However, synchronization libraries can hide the mechanism from programmer view. The only change needed to use tape mechanisms in these cases is linking with a different library. Moreover, tape mechanisms can be added to applications incrementally. Applications can be developed and tested without tapes. Since tape mechanisms do not affect correctness, adding tape calls cannot break any application that has already been debugged.
We used tapes to implement Tapeworm, a new synchronization library that is layered on top of existing consistency and synchronization protocols in CVM [Keleher 1996], a software distributed shared memory system. The use of tapes allowed us to write Tapeworm in fewer than 400 lines of C++ code. At the same time, Tapeworm is able to track and use very sophisticated data-movement patterns. Specifically, Tapeworm augments ordinary locks to include data-movement semantics in addition to synchronization semantics. Tapeworm also supports producer-consumer regions and record-replay barriers. Record-replay barriers use recordings of data accesses from one iteration of an application to anticipate accesses during future iterations.
Overall, Tapeworm eliminates an average of 85% of data misses on our suite of applications. The reduction in misses translates into a reduction in message traffic of 63%, and an average improvement in overall performance of approximately 29%.
The rest of the article is as follows. Section 2 discusses the high-level semantics of tapes in a protocol-independent fashion. Section 3 describes the protocol-independent interface to Tapeworm, a high-performance synchronization library built using tapes. Section 4 describes the requirements that the tapes abstraction, and Tapeworm in particular, makes on the underlying consistency protocols, and Section 5 describes Tapeworm's performance. Section 6 describes the use of tapes in emulating update-based protocols such as home-based LRC [Zhou et al. 1996] and scope consistency [Iftode et al. 1996]. Section 7 describes related work, and Section 8 concludes.
2. TAPE SEMANTICS
A tape is a recording of shared accesses. Clearly, a tape containing a record
of all accesses to shared memory could not be implemented efficiently in software. Hence, accesses must be manipulated in coarser units. We assume that the underlying constancy protocol aggregates accesses by taking advantage of both spatial and temporal locality in the application. Spatial locality is exploited by grouping all accesses to the same object or page into a single unit. Hereafter, we will refer to this unit as a page, but it could refer to any systematic grouping of consecutive addresses. Temporal locality is exploited by dividing each process' execution into distinct intervals, each of which is labeled with a system-unique interval ID. The exact method by which intervals are defined is not important, although most protocols will probably delimit intervals by synchronization events. For example, each of the processes in Figure 1 has two intervals, delimited by synchronization accesses to lock [L.sub.1]. These optimizations allow a tape to be constructed from lists of modified pages during distinct intervals, instead of addresses and cycle counts.
More specifically, a tape consists of a set of events, each of which is a three-tuple (x, y, z), where x is an interval ID; y is a set of page IDs; and z is a processor ID. The processor ID is not shown in the text below where it can be derived from context. We assume here that such events only correspond to write operations, but we extend the discussion to reads and requests below. Hence, [tape.sub.1] in Figure 1 consists of the three events {(1,1), (1,2), (1,3)). Note that the event (and the tape) consists only of the tuple; it does not contain the actual modifications. The actual modifications are tracked by the underlying protocol.
Tapes can be created in several different ways, but the primary method is that shown in Figure 1, e.g., recording accesses over a period of time. This method of creating tapes enables synchronization protocols to capture dynamic access patterns at runtime, rather than relying on the programmer or compiler to derive complete information statically.
A second method of creating tapes is for them to be generated by hooks into the underlying consistency protocol. While we defer full discussion of the interface to the underlying protocol until Section 3, hole_tape (Extent *) is fundamental to some, of the interfaces discussed in the next section. Its function is to create and return a tape that describes all updates needed to validate the region of memory described by an extent. A shared page is validated by applying all updates necessary to bring the page up-to-date.
Extent is short for "data extent." An extent is merely a list of pages. Extents are useful when the full information encoded in a tape is not needed. For example, if a synchronization interface needs to know the set of pages modified by a process while a lock is held, a tape is created by recording accesses during the synchronized period. The tape is then projected into an extent listing the pages accessed by the tape's events by removing all information from the tape's tuples except page IDs.
There are three tape variations. The canonical form is a write tape, created primarily by recording write accesses. Read tapes are created by recording read accesses. Finally, request tapes can be created by recording data requests received by the local node. Request tapes can be used to locally obtain information about the data accessed by other nodes. Unlike read and write tapes, request tapes are not complete. They do not describe all accesses made by a node at any specific time. Nonetheless, they can be a cheap and useful way of obtaining information about remote accesses without explicitly requesting it.
Once a tape has been created, it can be transmitted to remote sites, projected into an extent, pruned to contain only notices that pertain to a given extent, or added to another tape. Most importantly, the tape can be used to request a set of pages or updates that will soon be needed locally, before the data are needed.
The approach shown in this example has several advantages over other approaches described in the literature. Simple update protocols push modified data to existing replicas to update them, rather than to invalidate them. The advantage of such protocols is that subsequent page faults are avoided, but the lack of any selectivity often causes update protocols to move far more data than invalidate protocols. Several researchers have described more selective update protocol variants [Amza et al. 1996; Tseng and Keleher 1997; Wittie et al. 1992] that might also suffice in this example. However, these protocols effectively encode expected sharing behavior into the underlying protocol. By making such expectations part of the programmable protocol interface, the tape mechanism has far more flexibility.
2.1 Operations
As mentioned above, tapes can be added, subtracted, and projected to extents. We discuss allowable operations in more detail in this section. Recall that a tape, [T.sub.a], is an unordered set of three-tuples, each of which contains an interval ID, a page ID, and a process ID: ([v.sub.a], [m.sub.b], [p.sub.c]). Addition is a simple set operation:
(1) [T.sub.a] + [T.sub.b] = {x | x [element of] [T.sub.a] [disjunction] x [element of] [T.sub.b]}
Subtraction is similar:
(2) [T.sub.a] - [T.sub.b] = {x | x [element of] [T.sub.a] [conjunction] x [is not element of] [T.sub.b]}
Projection is used to extract information from a single tape dimension, such as the set of pages accessed. Projection of a single three-tuple consists of extracting the appropriate index. We denote extracting the second index of a three-tuple as follows:
(3) [[Pi].sub.2](a, b, c) = b
Projection of a an entire tape into a single dimension (either interval indices, page ranges, or processor lists) can be defined as follows:
[[Pi].sub.2]([T.sub.a]) = {b | [exists]a, c where (a, b, c) [element of] [T.sub.a]}
Such a projection defines an extent, either temporal, spatial, or processor.
The main use of extents is in pruning other tapes. For example, consider a static, iterative, three-process application where producer [P.sub.0] repeatedly modifies two chunks of data. One piece of the data is read during the iteration after it is produced by [P.sub.2], and the other by [P.sub.2]. Assume that [P.sub.0] creates a write tape, [T.sub.0], and assume [P.sub.1] creates a read tape, [T.sub.1], during iteration i. The set of data that will be accessed in iteration i + 1 by [P.sub.1] is a subset of the data named by [P.sub.0]'s iteration i write tape. More specifically, the data that will be needed by [P.sub.1] consists of all the data named in [T.sub.0] that pertains to the pages mentioned by [T.sub.1]. We can describe this data formally as follows:
(5) [Pi].sub.2]([T.sub.0] / [T.sub.1]) = [T.sub.0] - {(a, b, c) | b [is not element of] [Pi].sub.2]([T.sub.1])}
The resulting tape can be used to request precisely the page needed by [P.sub.1] in...
Read the full article for free courtesy of your local library.
|