AccessMyLibrary provides FREE access to millions of articles from top publications available through your library.
As information technology (IT) advances, the number of products released and their associated documents increase at a rapid pace. Technical authors, and in particular authors of documents for product support, often use terms and words that are not found in general-purpose dictionaries. A situation may arise in which the meaning given to technical terms varies within the technical community.
Terminology frequently changes with the introduction of new products to the market. The terminology database in a technical support organization is perhaps one of the most frequently updated databases of this kind. The IBM Technical Support knowledge base, for example, contains specifications, problem descriptions, proposed solutions, and updates on thousands of hardware and software products.
Glossaries can alleviate this problem. Glossaries help build a language common to people who search for information and people who author documents, thus increasing the effectiveness of search and retrieval systems. Because glossaries change rapidly and because of their large size, generating glossaries manually is costly. We describe in this paper the results of our investigation into automated glossary extraction. We also describe how we used an improved glossary extraction process to build and deploy a number of glossaries within the IBM Technical Support system used by customers. The business justification for building glossaries is to increase customer satisfaction when they use the IBM Technical Support Web site.
Technical documents pertaining to IBM products and services are processed, indexed, and stored in a master repository, known as the electronic support knowledge base (eSVd3) or the knowledge repository, which contains about a million documents in several languages. We have used eSKB as the corpus for our glossary extraction process, which integrates a number of tools and components into a complete solution.
The effectiveness of glossary extraction depends strongly on domain-specific resources, such as dictionaries, and also on the rules that generate labels or error codes, like APAR (authorized program analysis report) numbers or SOL (structured query language) errors. The glossary extraction processes, which are normally trained on general corpora like TREC (Text Retrieval Conference) (1) and do not take into account a specific domain, such as technical support, produce less useful glossaries for technical support applications. We found that domain-focused glossary extraction, where the term weights depend on document context, improves the effectiveness of the glossary.
In this paper we show several ways to improve the usefulness of the glossary and to make it more effective and robust for technical-support applications. To demonstrate the value of our approach we implemented Keyword Analyzer (KWA), an application that identifies salient terms in a document by using weighted terms from the glossary.
The rest of the paper is structured as follows. In the next section we present an overview of the architecture of the information search and retrieval system used by IBM Technical Support. Then, we summarize the approach to glossary extraction from Reference 2, which we will use as our starting point. In the following section we describe our implementation of the automated glossary extraction process for the technical support corpus and discuss the results we obtained. We observe that implementing the glossary extraction process without considering the specifics of the domain may lead to some erroneous results, and consequently, we present suggestions for improvement. Next we introduce KWA, the application we use to evaluate the effectiveness of our approach. We then propose the concept of a domain-focused glossary, in which glossary items are selected and ranked based on context, and we show some quantitative results from our tests. In this section we also discuss a possible application of the domain-focused glossary: the improvement of document-relevancy ranking in corporate search systems. We summarize our work in the concluding section.
Overview of the IBM Technical Support Enablement Architecture
The IBM Technical Support Enablement Architecture, whose implementation is nicknamed dBlue, is an advanced information search and delivery architecture for the Web-based system used by IBM Technical Support. (3) One of the goals of this system is to help customers find the desired information among the 2.5 million Web pages stored on the system. The dBlue system, which integrates effective technologies in storing, searching, and retrieving information, provides a set of user-oriented support services used by all IBM support sites.
The architecture connects three important types of elements from the information search world--information sources, search engines, and end users (see Figure 1). This is done through a set of components called the Knowledge Builder, which includes a content creation layer (blue blocks), a search management layer (green blocks) and a presentation management layer (red blocks). Information sources are any structured and unstructured data sources such as document repositories, DB2 and Lotus Notes databases, Web sites, and so forth. The first challenge of the architecture was to institute a consistent structure for content creation because the huge amount of support content that already existed was not well suited for searching. Then, of course, both existing and new content had to be migrated to this structure. The second challenge was determining how to store this information in a way that was scalable and flexible. The third challenge was how to retrieve it dynamically and efficiently. The main blocks of this architecture are shown in Figure 1. Content is extracted from the information sources using the Content Extractor and mapped to a unified XML (eXtensible Markup Language) schema. Then it is processed by the Content Processor and stored into the eSKB. The search management layer enables the connection between the Knowledge Builder and search engines. The Query Manager and Query Builder are responsible for processing search queries, collecting query-related parameters from the configuration management layer, and building the final search query. The presentation management layer provides several levels of customization, based on country, organizational unit, and individual user profiles. The View Builder constructs a customized view of search hitlists and documents. When a user requests a view of a specific document, this request is processed by the View Builder, which accesses the eSKB to retrieve the document content, and builds a coherent document view.
[FIGURE 1 OMITTED]
At the content creation layer the architecture enables the use of text analysis tools, technologies and resources, including the Technical Support glossary, for enhancing document content and improving the search experience. The Technical Support glossary, for instance, may be directly used as the vocabulary of domain-specific terms for spell-checking by various content management and search applications. The text analysis tools include keyword analysis and content summarization, briefly described in this section. The text analysis tools are also responsible for keeping the Technical Support glossary synchronized with the Technical Support corpus. Additional information on the dBlue system can be found in Reference 4.
The purpose of the keyword analysis utility is to identify domain-specific salient single-word or multiword terms in the document. The identified keywords are stored in the repository together with the corresponding documents and used as document meta-data for improving the search experience. A content summarization utility is used to create meaningful summaries of the stored documents. These summaries, which are stored in the repository along with the documents, are also displayed on search-results pages, thus further improving the search experience. We describe next the keyword analysis and content summarization techniques and their use in the dBlue system. For more details about the content summarization technique see Reference 5. Evaluation of different types of summarization and discussion of a related usability study with regard to the IBM Technical Support corpus can be found in Reference 6.
To identify domain-specific salient terms (keywords), each incoming document is processed by KWA. This utility is based on the TALENT (7) text analysis engine (TAE) configured to work with the Technical Support glossary. TALENT (Text Analysis and Language Engineering Technology) is a general suite of text analysis tools developed at the IBM Thomas J. Watson Research Center that recognizes significant objects in text (such as names, terms, relations, parts of speech and abbreviations) and annotates documents with this information (for more details see Reference 7). The TALENT TAE analyzes the content of the input document, identifies single- and multiword terms along with their variants (using the Technical Support glossary as the reference vocabulary), and returns the keywords and their variants together with the domain-specific confidence level of each in the Technical Support glossary.
In a similar fashion, KWA returns a sorted list of identified domain-specific keywords along with their variants to the content processing layer. After the keywords are stored in the knowledge repository as meta-data, they are indexed by the search engine indexer as the document meta-data, along with the document content itself, to enable keyword-based search as well as some other …