AccessMyLibrary provides FREE access to over 30 million articles from top publications available through your library.

Web archiving methods and approaches: a comparative study.

Library Trends

| June 22, 2005 | Masanes, Julien | COPYRIGHT 2008 Johns Hopkins University Press. This material is published under license from the publisher through the Gale Group, Farmington Hills, Michigan.  All inquiries regarding rights should be directed to the Gale Group. (Hide copyright information)Copyright

ABSTRACT

The Web is a virtually infinite information space, and archiving its entirety, all its aspects, is a utopia. The volume of information presents a challenge, but it is neither the only nor the most limiting factor given the continuous drop in storage device costs. Significant challenges lie in the management and technical issues of the location and collection of Web sites. As a consequence of this, archiving the Web is a task that no single institution can carry out alone. This article will present various approaches undertaken today by different institutions; it will discuss their focuses, strengths, and limits, as well as a model for appraisal and identifying potential complementary aspects amongst them. A comparison for discovery accuracy is presented between the snapshot approach done by the Internet Archive (IA) and the event-based collection done by the Bibliotheque Nationale de France (BNF) in 2002 for the presidential and parliamentary elections. The balanced conclusion of this comparison allows for identification of future direction for improvement of the former approach.

A VIRTUALLY INFINITE INFORMATION SPACE

Assessing the size of the Web is a difficult task, and many attempts to provide a reliable estimate of it have been made so far with limited success. We will not review these attempts here but instead outline major changes the Web has introduced and discuss their impact for Web archiving.

Authorship Revolution

The Blog phenomenon is the most recent illustration of this revolution: the first Web browser designed and coded by Tim Berners Lee included an authoring tool, which he considered to be an essential piece of the new system (Gillies & Caillau, 2000; Berners-Lee & Fischetti, 2000). Despite the subsequent omission of authoring tools from Web browsers, the Web has continued to offer an open publishing platform with global accessibility and continuous updating capacity.

This has dramatically changed the setting for publication, allowing almost anyone to bypass the traditional publishing actors and reach direct access to a potentially unlimited audience. The eventual impact of this change remains to be seen, but several consequences for archiving are already tangible.

The first important change is the end of an object's stability, with obvious impacts for archiving--an activity that in essence consists of capturing the state of an object at a point in time. The Web offers the ability to update content at any moment without notification (if additional notification mechanisms like Really Simple Syndication [RSS] protocol feeds are not in place), which poses a great challenge for archivists. Revisiting pages consumes resources, even if heuristics can be found to alleviate this process (Clausen, 2004). Choice of an appropriate frequency for capture can be problematic because, to be efficient, it should be done at the page level in most cases. It is indeed equivalent to assessing the probability of losing some intermediary updates between two captures.

Content Shaping

In addition to the change in the publication process, an important shift has occurred in the nature of documents themselves. The proliferation of citations that the hypertext environment allows induces a tremendous tendency toward dispersal of content, which archivists have to take into account in their approach. Web documents at the page level (but also the site level) hardly ever make sense alone. They are mingled in a larger document network that forms what Nelson named a "docuverse" (Nelson, 1992). From this perspective, archiving means extracting slices of the Web that constitute a whole metadocument (Landow, 1997); that is, spatially sampling the Web and making decisions each time regarding the exact perimeter of what to include, being aware that with time noninclusion means loss. For example, does archiving a site mean leaving out any document linked outside of its domain? If not, to what depth should external links be followed? There is no general answer to these questions, only specific ones based on the ultimate goal driving the archiving.

Choices also have to be made concerning what characteristics or functionalities are to be preserved. When a site is not primarily a collection of static pages, an archivist may focus on the interaction of functionalities (not only for navigation) and more generally the experience the site provides (1) in the archival context.

Convergence

It is worth noting that the Web is not only a platform absorbing previously existing Internet applications (mail, FTP, news) as well as non-Internet-based applications (database, document repository, and various information systems), but it also tends to be an entry point for almost everything today. This is a clear consequence of the design adopted for Uniform Resource Identifiers (URI), which Tim Berners-Lee insists is the most important standard of the Web (Berners-Lee & Fischetti, 2000). The prefix, the use of the Domain Name Server (DNS) system for host naming, and the flexibility offered for Webmasters regarding the right part of the URI, together make URI a powerful unifying standard. But for archivists this means almost everything can end up in their nets. If they want to focus on published material in the traditional sense of the word, they might want to filter online forums, for instance, or avoid diving into huge databases. Clues can be used for limiting the archiving, using, for instance, URI pattern detection (this has long been the case with search engines avoiding any dynamically generated content based on URI-embedded queries). This can extend to filtering content on the fly or during post-processing.

Technique

Even when the target is clearly identified and delimited, content acquisition can be an issue. Automatic tools for content gathering such as crawlers (also called spiders) (2) allow massive content acquisition at relatively low cost. With standard desktop computers and a Digital Subscriber Line (DSL) connection, it is possible today to retrieve millions of documents per week, even per day. Crawlers are also powerful and systematic tools for exploring the Web and discovering new sites through links even when starting from a very small set of seed sites.

There are severe crawler limitations, however, when it comes to finding a path to certain types of documents. First, access to sites or parts of sites can be restricted (with password or Internet Protocol [IP]…

Related articles from newspapers, magazines, journals, and more
TIM BERNERS-LEE: THE WEB'S BRAINCHILD.(interview with World Wide Web...
Magazine article from: UNESCO Courier Anbarasan, Ethirajan September 1, 2000 700+ words
...British inventor of the World Wide Web is worth more than his weight in gold. Tim Berners-Lee has shunned opportunities...international consortium grouping the Web's who's who. His foremost goal: to keep on improving the Web for the common good. How do...
W3C, Tim Berners-Lee To Present Mobile Web Vision at Mobile Internet World.
Press release article from: Business Wire October 23, 2007 700+ words
...focus on Mobile Web innovations, Web Inventor to Keynote and Host Media...w3.org/ -- W3C Director and Web inventor Tim Berners-Lee will deliver a keynote...Walled Garden: Growing the Mobile Web with Open Standards" at Mobile...
Sir Tim Berners-Lee: he created the Web. Now he's working on Internet...
Magazine article from: Technology Review (Cambridge, Mass.) Frauenfelder, Mark October 1, 2004 700+ words
...CREATING THE WORLD WIDE WEB didn't make Tim Berners-Lee instantly rich...that's because the Web sprang from relatively...you call the Semantic Web, but people don't seem too excited. Why not? TIM BERNERS-LEE: It's not the...
W3C, Tim Berners-Lee to Present Mobile Web Vision at Mobile Internet World.
Newspaper article from: Wireless News October 25, 2007 700+ words
...com W3C Director and Web inventor Tim Berners-Lee will deliver a...Growing the Mobile Web with Open Standards...integral part of "One Web." W3C will also host...World. Speakers include Tim Berners-Lee and representatives...
Sir Arthur C. Clarke to Join Tribute for Tim Berners-Lee, Inventor of the Web.
Press release article from: Business Wire October 29, 2002 700+ words
...engage in a dialogue with Tim Berners-Lee, the inventor of the World Wide Web, who will be inducted the...Like Dr.Clarke, 85, Tim Berners-Lee is a native of Britain...invention of the World Wide Web in 1989 as an Internet...
Tim Berners-Lee, Inventor of the World Wide Web, Named Winner of 2002 Marconi...
Press release article from: Business Wire September 26, 2002 700+ words
...Columbia University Tim Berners-Lee, inventor of the World Wide Web, has been named...In recognizing Tim Berners-Lee, we have...future role of the Web in light of the...development, Tim Berners-Lee invented the World Wide Web in 1989 as an...
Human Genome Mapping Pioneer, J. Craig Venter, and World Wide Web Inventor, Tim...
Press release article from: Business Wire January 25, 2005 700+ words
...human genome. Tim Berners-Lee, director of the World Wide Web Consortium, invented...human genes. Tim Berners-Lee invented the World Wide Web while a researcher...to enhance the Web. "We are thrilled...Craig Venter and Tim Berners-Lee have absolutely...
Newsweek Interview: Tim Berners-Lee Inventor of the World Wide Web.
Press release article from: PR Newswire September 20, 1999 700+ words
...PRNewswire/ -- Physicist Tim Berners-Lee, who invented the World Wide Web in 1989, says today...and downloading. The Web will still be a universal...In 1989, physicist Tim Berners-Lee invented the World Wide Web, transforming the Internet...
A father to...the world. (World Wide Web pioneer Tim Berners-Lee)...
Magazine article from: Computer Weekly Fawcett, Neil March 5, 1998 700+ words
...the future, with the man who inventeed the World-Wide Web, Tim Berners-Lee Tim Berners-Lee is the undisputed king of the Worldwide Web. He is the man who gave the Internet a face, or what...
Voice will power the future of the web.(interview Tim Berners-Lee)(Interview)
Magazine article from: Computer Weekly September 30, 2003 700+ words
interview Tim Berners-Lee, creator of the world wide web, talks about the future of semantic web services, how voice will interface...john.riley@rbi.co.uk Tim Berners-Lee, director of the World Wide Web Consortium (W3C), believes...
For more facts and information, see all results
©2010 Gale, a part of Cengage Learning. All rights reserved. About us | FAQs | Contact us | Privacy policy | Terms and conditions
Other Gale sites: Encyclopedia.com | HighBeam Research | Acquire Content | Books & Authors | Goliath | MovieRetriever | Smart QandA

The AccessMyLibrary advertising network includes: womensforum.com GlamFamily