AccessMyLibrary provides FREE access to over 30 million articles from top publications available through your library.
Create a link to this page
Copy and paste this link tag into your Web page or blog:
ABSTRACT
The Persistent Archive Testbed and National Archives and Records Administration (NARA) research prototype persistent archive are examples of preservation environments. Both projects are using data grids to implement data management infrastructure that can manage technology evolution. Data grids are software systems that provide persistent names to digital entities, manage data that are distributed across multiple types of storage systems, and provide support for preservation metadata. A persistent archive federates multiple data grids to provide the fault tolerance and disaster recovery mechanisms essential for long-term preservation. The capabilities of the prototype persistent archives will be presented, along with examples of how the capabilities are used to support the preservation of email, Web crawls, office products, image collections, and electronic records.
PROTOTYPE PRESERVATION ENVIRONMENTS
The San Diego Supercomputer Center (SDSC) collaborates with the National Archives and Records Administration (NARA) on research on the development of a prototype persistent archive. The collaboration examines how advanced data management systems can be used to support the long-term preservation of data. The original goal included an assessment of mechanisms for management of technology obsolescence. The ability to migrate electronic records to new storage systems was called "infrastructure independence." The preservation system should be extensible and be able to use more cost-effective storage technologies as they become available. A second goal was the assessment of scalability mechanisms that would enable support for archives holding hundreds of millions of files and hundreds of terabytes of data. The data management technology that meets these goals is called a "data grid." This article examines how data grids support preservation requirements.
Preservation is the process of migrating a digital entity forward in time while preserving its authenticity and integrity. (1) Authenticity is an assertion that a specific digital entity can be identified relative to the context in which it was created. The context includes provenance information such as the creator of the digital entity, procedural information such as the processes that were used to create the digital entity, and administrative information such as the institution that authorized the digital entity creation. The integrity of a digital entity is an assertion that the information content of it has not been modified, that the chain of custody can be verified, and that transformations on its encoding format were performed by identified archival procedures.
A digital entity can be an electronic record, a data file created by a scientific application, a text file created by a word processing system, an image taken by a remote sensor, or any string of bits that can be named. The preservation process requires the extraction of the digital entity from the environment in which it was created and the import of it into the preservation environment. Once the digital entity is under the control of the archivist, then the authenticity and integrity properties can be implemented with assurance that continued access is sustainable. This article looks at the challenges that must be overcome when extracting a digital entity from its creation environment, the technologies that can be used to manage authenticity and integrity, and some examples of preservation environments.
PRESERVATION CHALLENGES
The idea that a digital entity can be extracted from its creation environment is called infrastructure independence (Moore et al., 2000). A digital entity depends upon both software and hardware infrastructure to ensure its support and management. Thus, a file resides in a file system that provides a storage location, a name for the file, management of file properties, names for the persons who are allowed to manipulate the file, and controls on the type of permitted operations. The file properties typically include the size of the file, the owner of the file, the date the file was created, and the date the file was last modified. The extraction of the digital entity from this supporting environment requires the ability to impose
* storage of the digital entity at a location specified by the archivist
* a persistent naming convention for the digital entity that remains invariant as the digital entity is moved between storage systems
* management of file properties that are needed to assert authenticity and integrity
* persistent identifiers for the archivists who are managing the preservation environment
* persistent management of the access controls for allowed operations.
Infrastructure independence means that no matter where the digital entity is stored, the archivist retains the ability to control each of the support properties, independently of the mechanisms provided by a particular choice of storage system. Ideally, an archivist would be able to import a digital entity into a preservation environment that guarantees that the naming conventions will persist through all future choices of technology. One way to implement infrastructure independence is to insert a data management layer between a digital entity and the underlying storage environment. The archivist controls the persistent naming conventions through the data management layer. This approach is illustrated in figure 1.
[FIGURE 1 OMITTED]
In the original creation environment, the application that created the digital entity interacted directly with the storage system (shown by the dashed arrow). In the preservation environment, the applications that are used for display and manipulation now interact with a storage system through a data grid, in which the digital entities have been organized as a collection (Rajasekar, Marciano, & Moore, 1999). The data collection is used to assign metadata attributes to each digital entity to manage the authenticity and integrity properties.
The data grid provides its own naming conventions to describe the logical storage location, the logical file name, the metadata attributes, the distinguished names for the archivists, and the control and consistency mechanisms. Each logical name space that is managed by the data grid is essential for implementing infrastructure independence. The logical name spaces can be used to manage digital entities that are distributed across multiple storage systems and located at multiple sites around the country. The logical name spaces make it possible to use global identifiers that do not change when a digital entity is moved to another storage system. We can illustrate this by considering examples of how each logical name space would be used by a preservation environment.
DATA GRIDS
The software infrastructure that implements a collection-based data management infrastructure for distributed data is called a data grid (Foster & Kesselman, 1999). The software infrastructure runs as an application (or server) on each computer platform that manages a storage system. The data grid servers talk to each other in a federated environment. Messages can be sent between servers to move files, replicate files, and access files. The digital entity properties managed by the data grid are stored in a database as…