AccessMyLibrary provides FREE access to millions of articles from top publications available through your library.
Create a link to this page
Copy and paste this link tag into your Web page or blog:
Text reveals better than images the spectrums of cost, complexity, quality, and effort involved in digitization. The number of input and output formats to manage in this category is large; the number and variety of source items deserving of digitization is staggering.
Used here, text refers to any source item comprised of written pages. The support media for each page may be paper, parchment, vellum, photostat, as well as any photographic format. All types of writing or printing are included in this category, including handwriting. Text sources may be in any language; some may be printed in multiple languages.
Books and other text sources may be bound or unbounc. Finally, they also may contain images and other nontext components (such as covers, endpapers). These multimedia sources also fall into the text digitization category.
Digitized text refers to three major genres of machine-readable data, each with spectrums of quality to achieve in production:
* Page images--digital images of each page (not searchable)
* Text or hidden text--plain text (ASCII), either keyed (transcribed) from each page, or generated from page images via optical character recognition (OCR) to yield alphanumeric data for indexing and searching, and sometimes for display
* Encoded text--text with descriptive markup (SGML, XML, HTML) to support multiple uses across applications (including navigation among different parts or features of multipage documents); transcribed or generated via OCR, frequently with correction, since encoded text is often displayed
In many library digitization workflows, both page images and either hidden or encoded text are produced as a cost-effective approach to preservation and access. Page images convey original layout and appearance; text facilitates keyword searching.
This chapter describes the program components necessary to produce discoverable, sustainable, and usable collections of digitized text for legacy collections of books, serials, manuscripts, archives, and other multipage works, as well as single-page source material--from small printed ephemera to oversize maps-whose meaningful content is primarily text, or text and line art.
Assuming that a library has already made appropriate program investments in digital library technology (Chapter 1) and digitization program management (Chapter 2), the baseline level of service for text digitization encompasses the staff, systems, and procedures necessary to manage all production tasks--from selection to delivery--for digitization projects.
Baseline text digitization services have the capacity to create page image or text products, with associated descriptive, structural, and administrative metadata that meet the following criteria:
* The digital reproduction is appropriately cataloged and discoverable, and the descriptive metadata are stored in a well-supported system
* The digital reproduction work can be opened and rendered as a properly sequenced, navigable multipage object
* The digital reproduction is appropriately named to be identified by some type of inventory control mechanism (ranging for a printed list to a complex database)
* All files corresponding to each page are appropriately structured, stored, identified, and documented (with administrative metadata) for ongoing management
* The copy can be reliably delivered by the library's (or partner organization's) designated applications for networked access
Fulfillment of these minimum criteria--whether measured against local or community definitions of what is appropriate or good (see, for example, NISO's Framework of Guidance for Building Good Digital Collections)--are presumed to offer the potential for sustainability.
Levels of service above-the-baseline are required for any project that explicitly states requirements for pictorial quality in page images or that mandates production of text of high enough quality to be displayed.
The above-the-baseline services would also be instituted to support search and discovery with controlled vocabularies, and to increase the likelihood of sustainability through production of standards-compliant administrative metadata.
Baseline production services
Provided that downstream (post-digitization) systems are in place to store cataloging data and digitized text, and to make the catalog(s) and digital objects Internet accessible, key text digitization tasks (and their attendant standards, specifications, and best practices) requiring infrastructure are:
* Production of descriptive metadata (cataloging)
* Production of page images or encoded text (or both)
* Production of structural metadata
As with image digitization, baseline production infrastructure for text does not necessarily need to be extensive or costly to create an Internet-accessible collection of digitized text.
Provided that cataloging procedures are well-defined and implemented and that all digital products are adequately stored, librarians can--as several libraries have demonstrated--digitize newspapers (of small dimensions), sheet music, or pamphlets with a single flatbed scanner, and then deliver digital reproductions as portable document format (PDF) objects accessible via the Web.
Baseline production services for text digitization implicitly recognize the requirements to digitize pages and to digitize works. Particularly when production tasks are distributed among different specialists in assembly-line fashion, everyone engaged in the digitization project must understand and assimilate these two units of work.
When methodically planning the baseline text digitization infrastructure, the manager will account for tradeoffs in costs, quality, and sustainability in eight production arenas of text digitization, as presented below. The challenge is to configure systems, staffing, and procedures (services) that reflect and support …