AccessMyLibrary provides FREE access to millions of articles from top publications available through your library.
Whenever librarians use Semantic Web services and standards for representing data, they also generate graphs, whether they intend to or not. Graphs are a new data model for libraries and librarians, and they present new opportunities for library services. In this paper we introduce graph theory and explore its real and potential applications in the context of digital libraries. Part 1 describes basic concepts in graph theory and how graph theory has been applied by information retrieval systems such as Google. Part 2 discusses practical applications of graph theory in digital library environments. Some of the applications have been prototyped at the Los Alamos National Laboratory Research Library, others have been described in peer-reviewed journals, and still others are speculative in nature. The paper is intended to serve as a high-level tutorial to graphs in libraries.
Part 1. Introduction to Graph Theory
Complexity surrounds us, and in the twenty-first century, our attempts at organization and structure sometimes lead to more complexity. In layman's terms, complexity refers to problems and objects that have many distinct but interrelated issues or components. There also is an interdisciplinary field referred to as "complex systems," which investigates emergent properties, such as collective intelligence. (1) Emergent properties are an embodiment of the old adage "the whole is greater than the sum of its parts." These are behaviors or characteristics of a system "where the parts don't give a real sense of the whole." (2) Libraries reside at the nexus of these two definitions: they are creators and caretakers of complex data sets (metadata), and they are the source of explicit records of the complex and evolving intellectual and social relationships underlying the evolution of knowledge.
Digital libraries are complex systems. Patrons visit libraries hoping to find some order in complexity or to discover a path to new knowledge. Instead, they become the integration point for a complex set of systems as they juggle resource discovery by interacting with multiple systems, either overtly or via federated search, and by contending with multiple vendor sites to retrieve articles of interest.
Contrast this with Google's simple approach to content discovery: a user enters a few terms in a single box, and Google returns a large list of results spanning the Internet, placing the most relevant results at the top of this list. No one would suggest using Google for all research needs, but its simplicity and recognized ability to answer routine searches is compelling. How, we wonder, can we bring a bit of Google to the library world?
Google harvests vast quantities of data from the web. This data aggregation is obviously complex. How does Google make sense of it all so that it can offer searchers the most relevant results? Answering this question requires understanding what Google is doing, which requires a working knowledge of graph theory. We can then apply these lessons to library systems, make sense of voluminous bibliometric data, and give researchers tools that are as effective for them as Google is for web surfers. Just as web surfers want to know which sites are most relevant, researchers want to know which of the relevant results are the most reliable, the most influential, and of the highest quality. Can quantitative metrics help answer these qualitative questions?
The more deeply libraries and librarians can mine relationships between articles and authors and between subjects and institutions, the more reliable are their metrics. Suppose some librarians want to compare the relative influence of two authors. They might first look at the authors' respective number of publications. But are those papers of equally high quality? They might next count all citations to those papers. But are the citing articles of high quality? Deeper still, they might assign different weights to each citing article using its own number of citations. At each step, whether realizing it or not, they are applying graph theory. With deeper knowledge of this subject, librarians can embrace complexity and harness it for research tools of powerful simplicity.
PageRank and the Global Giant Graph
Indexing the web is a massive challenge. The Internet is a network of computer hardware resources so complex that no one really knows exactly how it is structured. In fact, researchers have resorted to conducting experiments to discern the structure and size of the Internet and its potential vulnerability to attacks. Representations of the data collected by these experiments are based on network
science, also known as graph theory. This is not the same network that ties all the computers on the Internet together, though at first glance it is a similar idea. Network science is a technique for representing the relationships between components of a complex system. (3) It uses graphs, which consist of nodes and edges, to represent these sets of relationships.
[FIGURE 1 OMITTED]
Generally speaking, a node is an actor or object of some sort, and an edge is a relationship or property. In the case of the web, universal resource locators (URLs) can be thought of as nodes, and connections between pages can be thought of as links or edges. This may sound familiar because the Semantic Web is largely built around the idea of graphs, where each pair of nodes with a connecting edge is referred to as a triple. In fact, Tim Berners-Lee refers to the Semantic Web as the Global Giant Graph--a place where statements of facts about things are published online and distinctly addressable, just as webpages are today. (4)
The Semantic Web differs from the traditional web in its use of ontologies that place meaning on the links and in the expectation that nodes are represented by universal resource identifiers (URIs) or by literal (string, integer, etc.) values, as shown in figure 1, where the links in a web graph have meaning in the Semantic Web.
Semantic Web data are a form of graph, so graph analysis techniques can be applied to semantic graphs, just as they are applied to representations of other complex systems, such as social networks, cellular metabolic networks, and ecological food webs. Herein lies the secret behind Google's success: Google builds a graph representation of the data it collects. These graphs play a large role in determining what users see in response to any given query.
Google uses a graph analysis technique called Eigenvector centrality. (5) In essence, Google calculates the relative importance of a given webpage as a function of the importance of the pages that point to it. A simpler centrality measure is called degree centrality. Degree centrality is simply a count of the number of edges a given node has. In a social network, degree centrality might tell you how many friends a given person has. If a person has edges representing friendship that connect him to seventeen other nodes, representing other people in the network, then his degree value is seventeen (see figure 2). If a person with seventeen friends has more friendship edges than any other person in the network, then he has the highest degree centrality for that network.
Eigenvector centrality expands on degree centrality. Consider a social network that represents the amount of influence a person has in a business context. If we want to analyze this aspect of the network, then it makes sense to consider the fact that some relationships are more influential than others. For example, a relationship with the president of the company is more significant than a relationship with a coworker, since it is a safe assumption that a direct relationship with the company leader will increase influence. So we assign weights to the edges based on who the edge connects to.
Google does something similar. All the webpages they track have centrality values, but Google's weighting algorithm …