AccessMyLibrary provides FREE access to millions of articles from top publications available through your library.
We review some notable studies on the growth of the Internet and on technologies useful for information search and retrieval on the Web. Writing about the Web is a challenging task for several reasons, of which we mention three. First, its dynamic nature guarantees that at least some portions of any manuscript on the subject will be out-of-date before it reaches the intended audience, particularly URLs that are referenced. Second, a comprehensive coverage of all of the important topics is impossible, because so many new ideas are constantly being proposed and are either quickly accepted into the Internet mainstream or rejected. Finally, as with any review paper, there is a strong bias in presenting topics closely related to the authors' background, and giving only cursory treatment to those of which they are relatively ignorant. In an attempt to compensate for oversights and biases, references to relevant works that describe or review concepts in depth will be given whenever possible. This being said, we begin with references to several excellent books that cover a variety of topics in information management and retrieval. They include Information Retrieval and Hypertext [Agosti and Smeaton 1996]; Modern Information Retrieval [Baeza-Yates and Ribeiro-Neto 1999]; Text Retrieval and Filtering: Analytic Models of Performance [Losee 1998]; Natural Language Information Retrieval [Strzalkowski 1999]; and Managing Gigabytes [Witten et al. 1994]. Some older, classic texts, which are slightly outdated, include In. formation Retrieval [Frakes and Baeza-Yates 1992]; Information Storage and Retrieval [Korfhage 1997]; Intelligent Multmedia Information Retrieval [Maybury 1997]; Introduction to Modern Information Retrieval [Salton and McGill 1983]; and Readings in Information Retrieval [Jones and Willett 1977].
Additional references are to special journal issues on search engines on the Internet [Scientific American 1997]; digital libraries [CACM 1998]; digital libraries, representation and retrieval [IEEE 1996b]; the next generation graphical user interfaces (GUIs) [CACM 1994]; Internet technologies [CACM 1994; IEEE 1999]; and knowledge discovery [CACM 1999]. Some notable survey papers are those by Chakrabarti and Rajagopalan ; Faloutsos and Oard ; Feldman ; Gudivada et al. ; Leighton and Srivastava ; Lawrence and Giles [1998b; 1999b]; and Raghavan . Extensive, up-to-date coverage of topics in Web-based information retrieval and knowledge management can be found in the proceedings of several conferences, such as: the International World Wide Web Conferences [WWW Conferences 2000] and the Association for Computing Machinery's Special Interest Group on Computer-Human Interaction [ACM SIGCHI] and Special Interest Group on Information Retrieval [ACM SIGIR] conferences
This paper is organized as follows. In the remainder of this section, we discuss and point to references on ratings of search engines and their features, the growth of information available on the Internet, and the growth in users. In the second section we present tools for Web-based information retrieval. These include classical retrieval tools (which can be used as is or with enhancements specifically geared for Web-based applications), as well as a new generation of tools which have developed alongside the Internet. Challenges that must be overcome in developing and refining new and existing technologies for the Web environment are discussed. In the concluding section, we speculate on future directions in research related to Web-based information retrieval which may prove to be fruitful.
1.1 Ratings of Search Engines and their Features
About 85% of Web users surveyed claim to be using search engines or some kind of search tool to find specific information of interest. The list of publicly accessible search engines has grown enormously in the past few years (see, e.g., blueangels.net), and there are now lists of top-ranked query terms available online (see, e.g.,
One of the keys to becoming a popular and successful search engine lies in the development of new algorithms specifically designed for fast and accurate retrieval of valuable information. Other features that make a search or portal site highly competitive are unusually attractive interfaces, free email addresses, and free access time [Chandrasekaran 1998]. Quite often, these advantages last at most a few weeks, since competitors keep track of new developments (see, e.g.,
"Lycos, one of the biggest and most popular search engines, is legendary for its unavailability during work hours." [Webster and Paul 1996]
There are many publicly available search engines, but users are not necessarily satisfied with the different formats for inputting queries, speeds of retrieval, presentation formats of the retrieval results, and quality of retrieved information [Lawrence and Giles 1998b]. In particular, speed (i.e., search engine search and retrieval time plus communication delays) has consistently been cited as "the most commonly experienced problem with the Web" in the biannual WWW surveys conducted at the Graphics, Visualization, and Usability Center of the Georgia Institute of Technology.(1) 63% to 66% of Web users in the past three surveys, over a period of a year-and-a-half were dissatisfied with the speed of retrieval and communication delay, and the problem appears to be growing worse. Even though 48% of the respondents in the April 1998 survey had upgraded modems in the past year, 53% of the respondents left a Web site while searching for product information because of "slow access." "Broken links" registered as the second most frequent problem in the same survey. Other studies also cite the number one and number two reasons for dissatisfaction as "slow access" and "the inability to find relevant information," respectively [Huberman and Lukose 1997; Huberman et al. 1998]. In this paper we elaborate on some of the causes of these problems and outline some promising new approaches being developed to resolve them.
It is important to remember that problems related to speed and access time may not be resolved by considering Web-based information access and retrieval as an isolated scientific problem. An August 1998 survey by Alexa Internet
The volume of information on search engines has exploded in the past year. Some valuable resources are cited below. The University of California at Berkeley has extensive Web pages on "how to choose the search tools you need"
The work of Lidsky and Kwon  is an opinionated but informative resource on search engines. It describes 36 different search engines and rates them on specific details of their search capabilities. For instance, in one study, searches are divided into five categories: (1) simple searches; (2) custom searches; (3) directory searches; (4) current news searches; and (5) Web content. The five categories of search are evaluated in terms of power and ease of use. Variations in ratings sometimes differ substantially for a given search engine. Similarly, query tests are conducted according to five criteria: (1) simple queries; (2) customized queries; (3) news queries; (4) duplicate elimination; and (5) dead link elimination. Once again, variations in the ratings sometimes differ substantially for a given search engine. In addition to ratings, the authors give charts on search indexes and directories associated with twelve of the search engines, and rate them in terms of specific features for complex searches and content. The data indicate that as the number of people using the Internet and Web has grown, user types have diversified and search engine providers have begun to target more specific types of users and queries with specialized and tailored search tools.
Web Search Engine Watch
A note of caution: in digesting the data in the paragraphs above and below, published data on the Internet and the Web are very difficult to measure and verify. GVU offers a solid piece of advice on the matter:
"We suggest that those interested in these (i.e., Internet / WWW statistics and demographics) statistics should consult several sources; these numbers can be difficult to measure and results may vary between different sources." [GVU's WWW user survey]
Although details of data from different popular sources vary, overall trends are fairly consistently documented. We present some survey results from some of these sources below.
1.2 Growth of the Internet and the Web
Schatz  of the National Center for Supercomputing Applications (NCSA) estimates that the number of Internet users increased from I million to 25 million in the five years leading up to January of 1997. Strategy Alley  gives a number of statistics on Internet users: Matrix Information and Directory Services (MIDS), an Internet measurement organization, estimated there were 57 million users on the consumer Internet worldwide in April of 1998, and that the number would increase to 377 million by 2000; Morgan Stanley gives the estimate of 150 million in 2000; and Killen and Associates give the estimate as 250 million in 2000. Nua's surveys
Most data on the amount of information on the Internet (i.e., volume, number of publicly accessible Web pages and hosts) show tremendous growth, and the sizes and numbers appear to be growing at an exponential rate. Lynch has documented the explosive growth of Internet hosts; the number of hosts has been roughly doubling every year. For example, he estimates that it was 1.3 million in January of 1993, 2.2 million in January of 1994, 4.9 million in January of 1995, and 9.5 million in January of 1996. His last set of data is 12.9 million in July of 1996 [Lynch 1997]. Strategy Alley  cites similar figures: "Since 1982, the number of hosts has doubled every year." And an article by the editors of the IEEE Internet Computing Magazine states that exponential growth of Internet hosts was observed in separate studies by several experts [IEEE 1998a], such as Mark Lottor of Network Wizards
The number of publicly accessible pages is also growing at an aggressive pace. Smith  estimates that in January of 1997 there were 80 million public Web pages, and that the number would subsequently double annually. Bharat and Broder  estimated that in November of 1997 the total number of Web pages was over 200 million. If both of these estimates for number of Web pages are correct, then the rate of increase is higher than Smith's prediction, i.e., it would be more than double per year. In a separate estimate [Monier 1998], the chief technical officer of AltaVista estimated that the volume of publicly accessible information on the Web has grown from 50 million pages on 100,000 sites in 1995 to 100 to 150 million pages on 600,000 sites in June of 1997. Lawrence and Giles summarize Web statistics published by others: 80 million pages in January of 1997 by the Internet Archive [Cunningham 1997], 75 million pages in September of 1997 by Forrester Research Inc. [Guglielmo 1997], Monier's estimate (mentioned above), and 175 million pages in December 1997 by Wired Digital. Then they conducted their own experiments to estimate the size of the Web and concluded that:
"it appears that existing estimates significantly underestimate the size of the Web." [Lawrence and Giles 1998b]
Follow-up studies by Lawrence and Giles [1999a] estimate that the number of publicly indexable pages on the Web at that time was about 800 million pages (with a total of 6 terabytes of text data) on about 3 million servers (Lawrence's home page:
Given the enormous volume of Web pages in existence, it comes as no surprise that Internet users are increasingly using search engines and search services to find specific information. According to Brin and Paige, the World WideWebWorm(homepages:
"it is likely that top search engines will handle hundreds of millions (of queries) per day by the year 2000." [Brin and Page 1998]
The results of GVU's April 1998 WWW user survey indicate that about 86% of people now find a useful Web site through search engines, and 85% find them through hyperlinks in other Web pages; people now use search engines as much as surfing the Web to find information.
1.3 Evaluation of Search Engines
Several different measures have been proposed to quantitatively measure the performance of classical information retrieval systems (see, e.g., Losee ; Manning and Schutze ), most of which can be straightforwardly extended to evaluate Web search engines. However, Web users may have a tendency to favor some performance issues more strongly than traditional users of information retrieval systems. For example, interactive response times appear to be at the top of the list of important issues for Web users (see Section 1.1) as well as the number of valuable sites listed in the first page of retrieved results (i.e., ranked in the top 8, 10, or 12), so that the scroll down or next page button do not have to be invoked to view the most valuable results.
Some traditional measures of information retrieval system performance are recognized in modified form by Web users. For example, a basic model from traditional retrieval systems recognizes a three way trade-off between the speed of information retrieval, precision, and recall (which is illustrated in Figure 1). This trade-off becomes increasingly …