|
COPYRIGHT 2000 Association for Computing Machinery, Inc.
1. INTRODUCTION
Compilers for higher-order languages (e.g., Scheme, ML, Haskell) take great efforts to optimize function calls and returns because they are fundamental control structures. Before a function call occurs, context information is saved from registers into a "frame." In a compiler based on Continuation-Passing Style (CPS), this frame is the closure of a continuation function [Steele 1978].
In a CPS-based compiler, a closure environment is constructed at each function-definition site; it provides runtime access to bindings of free variables in the function body. Each function call is then implemented as first installing the corresponding closure environment, setting up the arguments (normally in registers), and then passing the control to the target by a "jump" instruction. Function returns are implemented in the same way because they are essentially calls to continuation functions if represented in CPS. A non-CPS-based compiler does not introduce explicit continuations, but it requires a similar mechanism to manage all the closures.
The goal of closure conversion is to choose closure representations that minimize stores (for closure creation), fetches (to access free variables), and memory use (for reachable data). Depending on the runtime behavior of each function, closures can be represented as data structures of virtually any shape--allocated in the heap, on the stack (if any), or in registers. Clearly, the decisions of where and how to represent closures at runtime can greatly affect the quality of generated code. For example, Kranz's Orbit [Kranz et al. 1986; Kranz 1987] compiler generates very fast code by having six different closure allocation strategies for different kinds of functions; the MIT scheme compiler [Rozas 1984; Hanson 1990] implements tail-recursive procedure calls as efficiently as most explicit iteration constructs by using special closure representations and calling conventions; the Standard ML of New Jersey compiler (SML/NJ) [Appel and Macqueen 1991; Appel and Shao 1992] represents the return-continuation closure using a set of callee-save registers to achieve closure sharing and fast access to free variable bindings. In our current work, we have integrated all of these techniques into a unified framework to achieve excellent code quality.
We have developed a new algorithm for choosing good closure representations. As far as we know, our new closure allocation scheme is the first to satisfy all of the following important properties:
--Unlike stack allocation and traditional linked closures, our shared closure representations are safe for space complexity (see Section 2); at the same time, they still allow extensive closure sharing.
--Our closure allocation scheme exploits extensive use of compile-time control and data flow information to determine the closure representations.
--Source-language functions that make several sequential function calls can build one shared closure for use by all the continuations, taking advantage of callee-save registers.
--Because activation records are also allocated in the heap, they can be shared with other heap-allocated closures. This is impossible under stack allocation because stack frames often have shorter lifetime than heap-allocated closures.
--Tail recursive calls--which are often quite troublesome to implement correctly on a stack [Hanson 1990]--can be implemented very easily.
--All of our closure analysis and optimizations can be cleanly represented using continuation-passing and closure-passing style [Appel and Jim 1989] as the intermediate language.
--Once a closure is created, no later writes are made to it; this makes generational garbage collection and call/cc efficient, and also reduces the need for alias analysis in the compiler.
--Because all closures are either allocated in the heap or in registers, first-class continuations (i.e., call/cc) are very easy to support efficiently.
Our new closure-allocation scheme does not use any runtime stack. Instead, all closure environments are either allocated in the heap or in registers. This decision may seem controversial, because stack allocation is widely believed to have better locality of reference, and deallocation of stack frames can be cheaper than garbage collection. Moreover, because heap allocated closures are not contiguous in memory, an extra memory write and read (of the frame pointer) are necessary at each function call. These assumptions, while true, are not as significant as one might think, and are offset by the disadvantages of stack allocation:
--As we will show in Section 4, because most parts of continuation closures are allocated in callee-save registers [Appel and Shao 1992], the extra memory write and read at each call can often be avoided. With the help of compile-time control and data flow information, the combination of shared closures and callee-save registers can often be comparable to or even better than stack allocation [Appel and Shao 1996].
--In a companion paper [Appel and Shao 1996], we show that stacks do not have a significantly better locality of reference than heap-allocated activation records, even in a modern cache memory hierarchy. Stacks do have a much better write-miss ratio, but not a much better read-miss ratio. But on many modern machines, the write-miss penalty is approximately zero [Jouppi 1993; Diwan et al. 1994].
--The amortized cost of collection can be very low [Appel 1987; Appel and Shao 1996], especially with modern generational garbage-collection techniques [Ungar 1986; Reppy 1993].
The main contribution of this paper is an efficient and safe-for-space closure conversion algorithm that integrates and improves most previous closure-analysis technique [Kranz 1987; Appel and Shao 1992; Steele 1978; Rozas 1984; Hanson 1990; Johnsson 1985] using a simple and general framework expressed in continuation-passing and closure-passing style [Appel and Jim 1989; Appel and Shao 1992]. Our new algorithm extensively exploits the use of compile-time control and data-flow information to optimize closure allocation strategies and representations. Our measurements show that the new algorithm reduces heap allocation by 36% and memory fetches for local/global variables by 43%; and improves the already-efficient code generated by an earlier version of the SML/NJ compiler by about 17% on a DECstation 5000.
2. EFFICIENCY AND SPACE SAFETY
It is difficult to design good closure-allocation schemes that are both time-efficient and space-efficient. In fact, optimization of closure representations is sometimes dangerous and even unsafe for space usage (i.e., the maximum size of simultaneously live data during execution). Chase [1988] observed that certain storage-allocation optimizations may convert a program that runs robustly into one that does not, due to the requirement of a larger fraction of memory than the program actually needs. Appel [1992] also noticed that programs using linked closures,(1) or stack-allocated activation records, may cause a compiled program to use much more memory.
Consider the program in Figure 2 written in Standard ML (SML) [Milner et al. 1990]. With flat closures(2) (see Figure 1), each evaluation of f (...)() yields a closure H for function h that contains just a few integers u, w, x, y, and z; the final result (i.e., result) contains N copies of the closure for h, thus it uses at most O(N) space. With linked closures (also see Figure 1), each closure for h contains a pointer to the closure for g, which contains a list v of size N. Since the final result keeps N closures for h simultaneously, it uses O([N.sup.2]) space instead of O(N). Apparently, this space leak is caused by inappropriately retaining some "dead" objects (i.e., v) that should be reclaimed earlier by the garbage collector.
[Figures 1-2 ILLUSTRATION OMITTED]
In 1992, we found several instances of real programs whose live data size (and therefore memory use) was unnecessarily large (with factors of 2 to 80) when compiled by early versions of our compiler that introduced this kind of space leak. All recent versions of SML/NJ have obeyed the "safe for space complexity" (SSC) rule, and users really did notice the improvement. The SSC rule is stated as follows: any local variable binding must be unreachable after its last use within its scope. Please see Appel [1992] for a more formal definition.
The SSC definition of reachability is often viewed as an optimization to eliminate environment-related space leaks, but we believe that it is a crucial property that any industrial-strength compiler for functional languages must satisfy. Consider the following ML function simulating N phases of a typical compiler:
fun compile (codeInSrc) = let val codeInIL1 = phase1(codeInSrc) val codeInIL2 = phase2(codeInIL1) ...... val codeInILN = phaseN(codeInILNm1) in codeInILN end
If we rewrite this code in C, we would manually use the free operations to deallocate all intermediate data structures (e.g., after a compilation phase is done). In a garbage-collected language, however, we no longer have the capability to manually free memory, since memory management is now done implicitly by the garbage collector. Naturally, we would hope that the compiler would be responsible for getting everything right.
Unfortunately, most compilers treat the compile function just like a simple C-like subroutine. Local variables such as codeInSrc, codeInIL1, ..., and codeInILN--assuming they are all pointers to some heap-allocated objects--are all placed in the same stack frame, and they are considered to be live until after we return from the call to phaseN and when we pop off the stack frame for compile. This is clearly unacceptable because it keeps all intermediate representations simultaneously live, thus requires much more memory than really necessary. A C programmer would have manually freed the intermediate representations from the earlier phases before moving onto the late phases.
We thus argue that garbage-collected languages must satisfy the new SSC scoping rule if they use functions pervasively. Each local variable should be considered "dead" after its last use in the current function body. By dead, we really mean that it is not contributing to the liveness of the data structure that it points to. Supporting the SSC rule is important because functions such as compile are common in large software. The SSC rule is not as important for C-like languages because we can manually deallocate intermediate data structures in the program source.
Traditional stack-allocation schemes and linked closures obviously violate the SSC rule because local variable bindings will stay on the stack until they exit their scope, so they may remain live even after their last use. Flat closures do satisfy the SSC rule, but they require that variables be copied many times from one closure to another. Many of the closure strategies described by Appel and Jim [1988] also violate the SSC rule.
Obeying the SSC rule can require extra copying of pointer values from an old closure that contains them (but also contains values not needed in a new context) into a new closure. One cannot simply "zap" the unneeded values in the old closure, since it is not clear whether there are other references to the old closure. The challenge is to find efficient closure strategies that obey the SSC rule while minimizing copying.
Our new closure-conversion algorithm uses safely linked closures (the 3rd column in Figure 1) that contain only variables actually needed in the function but avoid closure copying by grouping variables with the same lifetime into a sharable record.
In Figure 1, we use G, H, and I to denote the closure, and g, h, and i for code pointers. With flat closures, variables w, x, y, and z must be copied from the closure of g into the closure of h, and then into the closure of i, this is very expensive. With traditional linked closures, closures for h and i are unsafely reusing the closure for g, retaining the variable v that is not free in h or i; moreover, accessing variables w, x, y, and z inside I is quite expensive because at least two links needs to be traversed. By noticing that w, x, y, and z have same lifetime, the safely linked closure for g puts them into a separate record, which is later shared by closures for h and i. Unlike linked closures, the nesting level of safely linked closures never exceeds more than two (one layer for the closure itself; another for records of different life time), so they still enjoy very fast variable access time.
To support the SSC scoping rule for cases like the compile function, we treat all activation records as closure environments for continuation functions. As for flat closures, we insist that each such environment contain only variables used in the continuation. To avoid high copying overhead, we use a combination of safely linked closures and callee-save registers to represent such environment; the resulting code not only satisfies the SSC rule (thus minimize the memory use) but also supports very fast closure creation and free-variable access.
An alternative way is to stick to the standard stack-based scheme, which violates the SSC rule because dead local variables remain in the stack frame until the function returns. This can be partially fixed by associating a descriptor with each return address, showing which variables are live in the continuation [Appel and Shao 1996; Augustsson 1989]; the garbage collector then finds and interprets the descriptor during each collection to ignore tracing those "dead" local variables. In addition, free variables inside a function closure must be copied from the heap to the stack in order to fully obey the SSC rule. For example, in the following ML program:
fun f(h, u, v, w) = let fun g() = let val x = h(u, v) val y = h(x, v) val z = h(y, w) in z end in g end
The closure for function g contains free variables h, u, v, and w. Under the standard stack implementation, g's frame would contain a pointer to h's closure, and all four free variables of g are kept live until g returns. But variable u reaches its last use at the first call to h; keeping it live with the rest of the closure is clearly not space-safe.
In a real compiler, the stack-based scheme must also incorporate a wide range of other extensions to...
Read the full article for free courtesy of your local library.
|