The current study attempts to measure the extent to which "full view" volumes contained in Google Books constitute a viable generic research collection for works in the public domain, using as a reference collection the catalog of a major nineteenth-century research library and using as control collections--against which the reference catalog also would be searched the online catalogs of two other major research libraries: one that wets' actively collecting during the same period and one that began actively collecting at a later date. A random sample of 398 entries was drawn from the Catalogue of the Library of the Boston Athenaeum, 1807-1871, and searched against Google Books and the online catalogs' of the two control collections to determine whether Google Books constituted such a viable general research collection.
"There's an east wind coming, Watson."
"I think not, Holmes. It is very warm."
"Good old Watson! You are the one fixed point in a changing age. There's an east wind coming all the same, such a wind as never blew on England yet. It will be cold and bitter, Watson, and a good many of us may wither before its blast. But it's God's own wind none the less, and a cleaner, better, stronger land will lie in the sunshine when the storm has cleared."
--Arthur Conan Doyle, His Last Bow
On December 14, 2004, Google announced that it had concluded agreements with five major research libraries to begin what is now known as the Google Books Library Project. (1) The libraries--the so-called Google 5--were the New York Public Library and the libraries of Harvard, Michigan, Oxford, and Stanford universities. These libraries agreed to let Google digitize volumes from their printed book and serial collections in exchange for institutional copies of the digitized volumes. (2) While the agreements set broad parameters for cooperation, Google gave the libraries sole discretion in determining the volumes to be digitized.
The Library Project--and the discretion given the libraries in determining which volumes would be digitized--raises an interesting question: To what extent is Google creating a research collection? Coyle has suggested that the manner in which collections are being selected for inclusion in the Library Project--many being taken en bloc from low-use remote storage facilities--makes it difficult to characterize Google Books as a "collection" in the accepted sense, though for better or worse "it will become a de facto collection because people will begin using it for research." (3) Is this true? Is this testable? Can sheer volume, in fact, render moot the role of selection in this case? The current study attempted to answer these questions.
While the focus of this study was on content digitized by Google through 2008, one should keep in mind that the volume of available digitized content continues to grow. Since the initial Google 5 cooperative agreements at the end of 2004, Google has entered into agreements with an increasing number of research libraries, both in the United States and abroad, while the European Union has begun funding a digitization program of its own centered on the collections of European cultural heritage institutions (libraries, archives, and museums). (4) Initially, there also was competition from elsewhere in the commercial arena, but this proved to be comparatively short-lived. Within a year of the Google announcement, Microsoft, in cooperation with the Interact Archive, began to digitize print content from several libraries under the rubric of Live Search Books. In May 2008 this effort was abandoned, though content 'already digitized under that program--some 750,000 volumes--remained available via the Internet Archive. (5)
In terms of scope, several of the Library Project partnerships cover both older public domain materials and more recent publications still subject to copyright protection. To this extent they complement Google's partnerships with publishers to provide access to a continuity of content across time periods.
This continuity of content is important from Google's perspective. In Google's December 2004 press release, cofounder Larry Page set the Library Project in the context of his firm's stated mission "to organize the world's information and make it universally accessible and useful." (6) As a search engine, Google's principle interest in digitizing printed materials is in indexing the content, both structured and unstructured, to enhance search results. In its business model, Google uses search terms and results as triggers for the online display of related advertising. By providing additional indexed content from Google Books (and the Library Project), Google both increases the usefulness of its flagship search engine (by incorporating results from Google Books as well as other sources) and makes it more appealing to advertisers (by increasing the potential customer base to include researchers and other interested parties).
As has been noted frequently, Google is digitizing on an industrial scale, indeed on a scale unlike anything seen before. (7) The process is easy to describe. Books are removed from the shelves, barcodes are scanned--to change the volume's circulation status and to extract the related metadata from the catalog--and the volumes are removed to a Google facility for digitization. Google digitizes the individual page, subjects the digitized images to sophisticated (if not foolproof) optical character recognition (OCR), and finally indexes the OCR-extracted text. The digitized page images may be freely available for public viewing (if determined to be in the public domain), or viewing may be restricted in some way, depending on the copyright status of the digitized work (or one or more of its components) and the nature of Google's agreements with the publisher. (8)
What Is a Research Collection?
While other digitization programs on various scales also are under way (as noted above), none approaches the scale or ambition (or potential for market dominance) of Google Books. For this reason the volumes digitized by Google seemed the most appropriate objects of which to ask: Do these digitized volumes in themselves now constitute a viable general research collection? This may seem a fairly straightforward question, but it raises an antecedent question: What is a research collection?
In the abstract, a research collection is a collection of materials used primarily to support research (as opposed to one that supports teaching and learning or one that is used primarily for recreational purposes). Unfortunately, this definition does not lend itself to objective measurement, and it says nothing about the content of such a collection, since, in theory, any collection can support research of some kind.
Indeed, most research collections are developed to address the needs of a particular research community; a community that will typically reflect a variety of research interests and intensities. This variety will itself change over time. Research collections are by their nature complex. Such complexity underpins the design of the Conspectus model developed in the early 1980s by the Research Libraries Group for cooperative collection development. The Conspectus asked participants to characterize their collections according to a variety of parameters, including research area (defined by ranges within a bibliographic classification scheme or by subject descriptors), language, geographical scope, chronological periods, formats, and collection depth (this on a five-point scale, with 4 indicating "research level"). (9) This produced a nice matrix for describing the variety of possible research collections, but it also made dear that the idea of a "generic" research collection was an oxymoron.
Ideally, in addition to being designed for a particular research community, a research collection also satisfies the needs of that community. But while research collections can come in a variety of shapes and sizes, data suggest that whatever their shape and size, local researchers will always feel that their own library's collection falls short--this despite years of earnest collection development by librarians. At member institutions of the Association of Research Libraries, for example, respondents to successive LibQual+ library service quality surveys routinely report that their libraries provide inadequate support for their research needs. On three LibQual+ items measuring collection support for research, the "adequacy gap"--the degree to which an item exceeds (or not) a user's minimum requirements--has typically been a negative number. (10) To put as generous a spin as possible on the meaning of these LibQual+ responses, one could say that research collections always must be works in progress.
Although there may be no such thing as a typical research collection, at major research universities where the research communities are larger and more multidisciplinary--a certain amount of homogeneity can be expected to develop across the associated research collections. It is not unreasonable then to treat one of these collections as approximating a "generic" large university research collection.
Having posited that the collections of large research libraries approximate to a generic research collection, a question remains about how much unique content is found in a typical research library. No one knows for sure, but overlap studies suggest it is more than one might expect. (11) In a 2005 study (shortly after the announcement of the Google Books Library Project) Lavoie, Connaway, and Dempsey determined that the Google 5 libraries collectively held about one-third of the resources cataloged as books in the OCLC WorldCat database, and most of these resources--61 percent--were unique to just one of the five. This percent-age of uniquely held resources increased with the age of the resources involved, with 74 percent of resources published between 1801 and 1825 being held by just one Google 5 library. (12) However, there is unique and then there is unique. In subsequent research, Lavoie and Schonfeld examined a random sample of one hundred WorldCat records for such "uniquely held" resources and found that "many of the English language materials appear to be locally-produced ephemera rather than traditional published books." (13) This …