|
COPYRIGHT 2003 American Statistical Association
Advances in computing power are allowing researchers to use Bayesian hierarchical models (BHM) on problems previously considered computationally infeasible. This article discusses the procedure of migrating a BHM from a workstation-class implementation to a massively parallel architecture, indicative of the current direction of advances in computing hardware. The parallel implementation is nearly 500 times larger than the workstation-class implementation from the data perspective. The BHM in question combines the information from a scatterometer on board a polar-orbiting satellite and the result of a numerical weather prediction model and produces an ensemble of high-resolution tropical surface wind fields with physically realistic variability at all spatial scales.
Key Words: Fortran 90; Gibbs sampling; Parallel computing; Spatio-temporal processes.
1. INTRODUCTION
Recent advances in Bayesian methodology and hierarchical models have created a new era in statistical methods for the geosciences. However, geophysical data of scientific import are often large, complex, and pose difficulties for a statistical scientist who is interested in transferring nontrivial methods to a substantial problem. This article traces through the computational process of combining two very different and large wind field datasets to estimate a surface wind field at six-hour intervals for a three-year period for a region of the tropical ocean. The result is a "data product" that is more suitable for scientific research than the original observations and can be distributed in a compressed and convenient format. We use this focused project to illustrate some general concepts that are useful for other statistical groups tackling large geophysical problems or spatio-temporal datasets.
One motivation for these ideas comes from the interdisciplinary workshop on Massive Data Streams, convened by the Committee of Applied and Theoretical Statistics of the National Academy of Sciences December 2002. The workshop identified a range of massive data stream applications from such disparate fields as experimental particle physics, Internet communications, and Earth science. Applications based on the massive data streams implied by Earth-observing satellite systems differ from applications in particle physics or Internet communications in a fundamental way that we will explore here. In the case of the Earth-observing satellite data stream, data are rarely discarded and indeed, value-added datasets of the kind to be described here can increase the volume of data to be archived and reused. From a practical perspective, this project combines a large spatio-temporal input data stream with a computationally intense statistical model that results in a very large output data stream. Because we have chosen a simple format of drawing realizations from the posterior, the output data stream may be many times larger than the comparable input streams.
Another salient difference between geophysical data streams and other examples from the workshop is the spatio-temporal structure of the streams. Because geophysical data are often observations of a continuous process, the statistical models, at least formally, must have global extent and do not readily break into independent pieces. We have found it useful to characterize a massive dataset based on its context. Although our datasets are smaller than those of some other problems, the process models for the ocean surface wind require that many parts of the data are related to each other and these interactions add much more computational burden. This overhead is typical in spatial data problems where objects such as covariance matrices nominally have sizes that grow by the square of the number of observations.
Understanding how physical, chemical, and biological processes, as well as human activities, influence the complete Earth system is a grand scientific challenge for this century. Coupled with this integrated study is a rapid increase in Earth observations from spaceborne platforms. This includes new sensor systems, new types of target observations and finer resolutions and will result in exponentially increasing data volumes. A fruitful area for statistical science is to work with increasing observation dataset sizes without compromising the integrity of the statistical analyses. For example, often problems of "data fusion"--combining information from several sources or instruments--can be solved by hierarchical statistical models that relate the different forms of data to the geophysical quantities of interest. With large datasets and limited resources, however, it may not be possible to implement a hierarchical model in a straightforward manner.
1.1 COMBINING DATA
We propose little new statistical methodology. We believe that Bayesian hierarchical models (BHM) are rapidly becoming a mature and flexible tool for handling complex geophysical processes. In contrast to the richness of the statistical methodology, we feel there is a deficit in principles for adapting the methods for large and interesting problems. The goal of this article is to outline the many ways in which running jobs on a workstation is fundamentally different from using a shared, high-performance computing environment. We illustrate some general ideas by tracing through the production of a value-added surface wind dataset for the tropical Indian and western Pacific ocean, based on combining scatterometer observations from the SeaWinds system on the NASA QuikSCAT satellite (SWS on QSCAT) and a more conventional numerical weather prediction (NWP) data source from the National Centers for Environmental Prediction (NCEP). The work was done at the supercomputer facility at the National Center for Atmospheric Research (NCAR) and made use of a massively parallel system (called blackforest) coupled to a mass storage device. To our knowledge this project is larger, by perhaps two orders of magnitude, than other BHM applications.
Our basic statistical task is to blend or fuse two complementary datasets to produce a product that is superior to either one alone. The scatterometer data provide observations with unprecedented density over the global oceans, but with irregularly spaced gaps in space and time (see Figure 1). The numerical weather prediction (NWP) winds are on regular spatio-temporal grid but lack sufficient detail (e.g., Liu, Tang, and Polito 1998 and Figure 5, p. 800) to fully investigate important physical phenomenon, nor do they provide any estimate of uncertainty in the winds at the espoused locations and times. Our goal is to produce wind fields that blend the high-resolution observations from QSCAT with the NWP-estimated state of the atmosphere at large scales. The final format is a regular spatio-temporal grid of sufficient resolution to address interesting scientific questions. Unlike more standard analyses of geophysical data, we sample the posterior of the space-time wind process to generate realizations of the wind fields that are physically consistent. For the geophysical community this is a novel product because we explicitly produce an ensemble (i.e., a random sample) of 50 realized wind fields whose variation quantifies the uncertainty in the estimates. The realizations also provide excellent input fields for perturbation studies. We will come back to this and other uses in Section 6.
[FIGURE 1 OMITTED]
1.2 ALGORITHM MIGRATION
The experience we gained from this project can be categorized in three ways: adapting to hardware constraints, managing input and output streams, and leveraging the advantages of several programming environments. The first involves balancing software design with available computing resources and storage. For a massively parallel but shared resource, this turns out to be a nontrivial effort and essentially dictates the scope of the domain that can be accommodated in an individual run. The second category involves the coordination of data flow from the input data stream to the computational core and the subsequent handling of the output data stream. We should note that throughout this article we use the term "data" in a cavalier way that may be confusing to some readers. Quite simply, we view any quantitative information as "data" and, for example, this will include the highly processed QSCAT observations described in the next section. Finally, the development of the computational core was simplified by the flexible combination of several software environments. The development of an efficient FORTRAN 90 (F90) computational engine is facilitated by incremental comparisons to the pilot version coded in Matlab[R], an interactive, high-level language. The control of the computation in a batch environment is achieved by UNIX shell scripts and provides the ability to pass run-time information to a precompiled F90 executable. Although these problems might be solved in different ways, given the configuration of high-performance computing facilities and the nature of geophysical data, we feel that similar issues will need to be addressed for any analysis of a BHM with a large dataset.
The remainder of the article is organized to describe the details of these concepts. Section 2 gives an overview of the data streams and a brief review of the BHM for this project. The BHM and an initial application were developed in detail by Wikle, Milliff, Nychka, and Berliner (2001) (hereinafter WMNB) and served as the pilot project referred to below. Section 3 describes the NCAR computing resources and how they dictate our design, including the management of the data streams. The development of the software computational core is described in Section 4. Section 5 discusses production Gibbs sampling and Section...
Read the full article for free courtesy of your local library.
|