AccessMyLibrary provides FREE access to over 30 million articles from top publications available through your library.
Create a link to this page
Copy and paste this link tag into your Web page or blog:
ABSTRACT
The Web is a virtually infinite information space, and archiving its entirety, all its aspects, is a utopia. The volume of information presents a challenge, but it is neither the only nor the most limiting factor given the continuous drop in storage device costs. Significant challenges lie in the management and technical issues of the location and collection of Web sites. As a consequence of this, archiving the Web is a task that no single institution can carry out alone. This article will present various approaches undertaken today by different institutions; it will discuss their focuses, strengths, and limits, as well as a model for appraisal and identifying potential complementary aspects amongst them. A comparison for discovery accuracy is presented between the snapshot approach done by the Internet Archive (IA) and the event-based collection done by the Bibliotheque Nationale de France (BNF) in 2002 for the presidential and parliamentary elections. The balanced conclusion of this comparison allows for identification of future direction for improvement of the former approach.
A VIRTUALLY INFINITE INFORMATION SPACE
Assessing the size of the Web is a difficult task, and many attempts to provide a reliable estimate of it have been made so far with limited success. We will not review these attempts here but instead outline major changes the Web has introduced and discuss their impact for Web archiving.
Authorship Revolution
The Blog phenomenon is the most recent illustration of this revolution: the first Web browser designed and coded by Tim Berners Lee included an authoring tool, which he considered to be an essential piece of the new system (Gillies & Caillau, 2000; Berners-Lee & Fischetti, 2000). Despite the subsequent omission of authoring tools from Web browsers, the Web has continued to offer an open publishing platform with global accessibility and continuous updating capacity.
This has dramatically changed the setting for publication, allowing almost anyone to bypass the traditional publishing actors and reach direct access to a potentially unlimited audience. The eventual impact of this change remains to be seen, but several consequences for archiving are already tangible.
The first important change is the end of an object's stability, with obvious impacts for archiving--an activity that in essence consists of capturing the state of an object at a point in time. The Web offers the ability to update content at any moment without notification (if additional notification mechanisms like Really Simple Syndication [RSS] protocol feeds are not in place), which poses a great challenge for archivists. Revisiting pages consumes resources, even if heuristics can be found to alleviate this process (Clausen, 2004). Choice of an appropriate frequency for capture can be problematic because, to be efficient, it should be done at the page level in most cases. It is indeed equivalent to assessing the probability of losing some intermediary updates between two captures.
Content Shaping
In addition to the change in the publication process, an important shift has occurred in the nature of documents themselves. The proliferation of citations that the hypertext environment allows induces a tremendous tendency toward dispersal of content, which archivists have to take into account in their approach. Web documents at the page level (but also the site level) hardly ever make sense alone. They are mingled in a larger document network that forms what Nelson named a "docuverse" (Nelson, 1992). From this perspective, archiving means extracting slices of the Web that constitute a whole metadocument (Landow, 1997); that is, spatially sampling the Web and making decisions each time regarding the exact perimeter of what to include, being aware that with time noninclusion means loss. For example, does archiving a site mean leaving out any document linked outside of its domain? If not, to what depth should external links be followed? There is no general answer to these questions, only specific ones based on the ultimate goal driving the archiving.
Choices also have to be made concerning what characteristics or functionalities are to be preserved. When a site is not primarily a collection of static pages, an archivist may focus on the interaction of functionalities (not only for navigation) and more generally the experience the site provides (1) in the archival context.
Convergence
It is worth noting that the Web is not only a platform absorbing previously existing Internet applications (mail, FTP, news) as well as non-Internet-based applications (database, document repository, and various information systems), but it also tends to be an entry point for almost everything today. This is a clear consequence of the design adopted for Uniform Resource Identifiers (URI), which Tim Berners-Lee insists is the most important standard of the Web (Berners-Lee & Fischetti, 2000). The prefix, the use of the Domain Name Server (DNS) system for host naming, and the flexibility offered for Webmasters regarding the right part of the URI, together make URI a powerful unifying standard. But for archivists this means almost everything can end up in their nets. If they want to focus on published material in the traditional sense of the word, they might want to filter online forums, for instance, or avoid diving into huge databases. Clues can be used for limiting the archiving, using, for instance, URI pattern detection (this has long been the case with search engines avoiding any dynamically generated content based on URI-embedded queries). This can extend to filtering content on the fly or during post-processing.
Technique
Even when the target is clearly identified and delimited, content acquisition can be an issue. Automatic tools for content gathering such as crawlers (also called spiders) (2) allow massive content acquisition at relatively low cost. With standard desktop computers and a Digital Subscriber Line (DSL) connection, it is possible today to retrieve millions of documents per week, even per day. Crawlers are also powerful and systematic tools for exploring the Web and discovering new sites through links even when starting from a very small set of seed sites.
There are severe crawler limitations, however, when it comes to finding a path to certain types of documents. First, access to sites or parts of sites can be restricted (with password or Internet Protocol [IP]…