AccessMyLibrary provides FREE access to over 30 million articles from top publications available through your library.
Create a link to this page
Copy and paste this link tag into your Web page or blog:
ABSTRACT
Detailed knowledge of the internal properties of digital representation formats is necessary to interpret properly the full information content of otherwise opaque digital objects. These properties form an important component of the representation information needed by repository workflows regardless of local preservation strategy and infrastructure decisions. The Digital Library Federation (DLF) has sponsored preliminary investigations toward establishing a Global Digital Format Registry (GDFR) that will function as a sustainable utility for maintaining the bindings between public identifiers for digital formats and the significant syntactic and semantic properties of those formats. A sustainable GDFR should prove to be of great utility to archives, libraries, digital repositories, and other organizations and individuals interested in the long-term viability of digital assets.
DIGITAL FORMATS
It has become commonplace for digital objects to be acceptable and valued assets under the collection development policies of many libraries, archives, museums, and other scientific and cultural heritage repositories with long-term preservation mandates. In general, a digital object can be considered as the encapsulation in digital form of some piece of abstract intellectual content. More specifically, a digital object is the aggregation of one or more formatted content streams representing the primary content of the object as well as associated descriptive, administrative, technical, and structural metadata. Without a thorough understanding of the format of those content streams, the ability to recover the original intellectual content from which those streams were derived is severely compromised, if not made impossible. Furthermore, common agreement on the syntax and semantics associated with an object's formatted content streams is necessary for the effective interchange of that object, whether between institutions implementing different technological infrastructures or between the various processing steps applied to the object as it passes through its intra-institutional life cycle. In essence, a format is the property associated with a content stream that provides the typing information necessary for its proper interpretation.
More formally, a format is a reversible, byte-serialized encoding of an abstract information model, which is itself a formal expression of exchangeable knowledge (International Organization for Standardization, 2003). A format defines the syntactic and semantic rules for the mapping from an information model to a byte stream and the inverse mapping from that byte stream back to the original information model. Historically, discussions of formats have been couched in terms of "file formats." However, as there are many contexts, such as the network transport of formatted content streams or consideration of content streams at a level of granularity finer than that of an entire file, where specific reference to "file" is inappropriate, the more general term "digital formats" will be used in this article.
FORMAT DEPENDENCIES IN REPOSITORY OPERATION
Digital repository operations can be distinguished into two broad categories: (1) those that are performed independent of the internal properties of its digital objects; and (2) those that are performed dependent upon the internal characteristics of the objects or, in other words, their format. With regard to the latter category, format dependencies exist in many, if not most, phases of repository operation. Figure 1 presents an idealized repository workflow based on the Open Archival Information System (OAIS) reference model (International Organization for Standardization, 2003). Although originally developed by the space science community, the OAIS model defines a general approach that is broadly applicable to repositories operating in nonscientific domains. It has been widely adopted as the conceptual framework for repository architecture and operation and has become part of the lingua franca within the digital preservation community.
[FIGURE 1 OMITTED]
Ingest Dependencies
In OAIS terms, digital objects are delivered to an archive or repository in the form of a Submission Information Package (SIP), a conceptual data structure that encapsulates both primary content and representation information about that content. Representation information is information that is necessary to map object content into more meaningful constructs relative to some designated community--in other words, metadata (Holdsworth & Sergeant, 2000). The specific format of an object content stream within a SIP is an important technical component of SIP metadata.
The OAIS Ingest function is responsible for Quality Assurance (QA) validation of SIP content. Some repositories may operate under local policies or statutory regimes that mandate an obligation to accept all SIPs regardless of validation status, while others may implement more stringent policies that reject SIPs that are not well formed or well characterized. Regardless, it is a reasonable repository best practice to validate incoming SIP content streams relative to the stated or inferred formats of those streams. Even for repositories that do not use validation status as an acceptance criterion, that status is nevertheless an important preservation metadata property that characterizes the state of a digital object at the point of ingest. Validation is performed with respect to the specific syntactic and semantic rules established by the format to which a content stream purportedly conforms. The Ingest function is the most effective point at which to detect and remediate errors occurring in archival materials (National Archives and Records Administration et al., 1999). Once digital objects are accepted into a repository, where they may not be accessed for significant periods of time, effective channels of communication with the original creators to ascertain their authorial intent with respect to those objects may become difficult, if not impossible.
The Ingest function is also responsible for disaggregating a SIP, passing the descriptive metadata to the archive Data Management function, and transforming the SIP into an Archival Information Package (ALP) encapsulating primary content and administrative and technical metadata. It is not necessary for object content streams within an ALP to have the same formats as the corresponding content streams in the SIP. In the interest of data homogeneity and its concomitant impact on operational efficiencies, many repositories may choose to define a restricted set of canonical AlP formats to which SIP content streams are transformed during the SIP-to-ALP conversion process. Quality assurance checks must be applied subsequent to all content stream…