AccessMyLibrary provides FREE access to over 30 million articles from top publications available through your library.

A Guided Tour to Approximate String Matching.(text processing)

ACM Computing Surveys

| March 01, 2001 | NAVARRO, GONZALO | COPYRIGHT 2001 Association for Computing Machinery, Inc. This material is published under license from the publisher through the Gale Group, Farmington Hills, Michigan.  All inquiries regarding rights should be directed to the Gale Group. (Hide copyright information)Copyright

1. INTRODUCTION

This work focuses on the problem of string matching that allows errors, also called approximate string matching. The general goal is to perform string matching of a pattern in a text where one or both of them have suffered some kind of (undesirable) corruption. Some examples are recovering the original signals after their transmission over noisy channels, finding DNA subsequences after possible mutations, and text searching where there are typing or spelling errors.

The problem, in its most general form, is to find a text where a text given pattern occurs, allowing a limited number of "errors" in the matches. Each application uses a different error model, which defines how different two strings are. The idea for this "distance" between strings is to make it small when one of the strings is likely to be an erroneous variant of the other under the error model in use.

The goal of this survey is to present an overview of the state of the art in approximate string matching. We focus on online searching (that is, when the text cannot be preprocessed to build an index on it), explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We also consider some variants of the problem. We present a number of experiments to compare the performance of the different algorithms and show the best choices. We conclude with some directions for future work and open problems.

Unfortunately, the algorithmic nature of the problem strongly depends on the type of "errors" considered, and the solutions range from linear time to NP-complete. The scope of our subject is so broad that we are forced to narrow our focus on a subset of the possible error models. We consider only those defined in terms of replacing some substrings by others at varying costs. In this light, the problem becomes minimizing the total cost to transform the pattern and its occurrence in text to make them equal, and reporting the text positions where this cost is low enough.

One of the best studied cases of this error model is the so-called edit distance, which allows us to delete, insert and substitute simple characters (with a different one) in both strings. If the different operations have different costs or the costs depend on the characters involved, we speak of general edit distance. Otherwise, if all the operations cost 1, we speak of simple edit distance or just edit distance (ed). In this last case we simply seek for the minimum number of insertions, deletions and substitutions to make both strings equal. For instance ed (" survey, "" surgery") = 2. The edit distance has received a lot of attention because its generalized version is powerful enough for a wide range of applications. Despite the fact that most existing algorithms concentrate on the simple edit distance, many of them can be easily adapted to the generalized edit distance, and we pay attention to this issue throughout this work. Moreover, the few algorithms that exist for the general error model that we consider are generalizations of edit distance algorithms.

On the other hand, most of the algorithms designed for the edit distance are easily specialized to other cases of interest. For instance, by allowing only insertions and deletions at cost 1, we can compute the longest common subsequence (LCS) between two strings. Another simplification that has received a lot of attention is the variant that allows only substitutions (Hamming distance).

Related articles from newspapers, magazines, journals, and more
Publication No. WO/2009/094649 Published on July 30, Assigned to SRA...
News wire article from: US Fed News Service, Including US State News August 4, 2009 700+ words
...developed a system and method for variant string matching. The patent has been assigned to...product, and system for variant string matching. A computer implemented method for variant string matching may comprise comparing with a computing...
Publication No. WO/2009/087757 Published on July 16, Assigned to Mitsubishi...
News wire article from: US Fed News Service, Including US State News July 20, 2009 700+ words
...of Japan, have developed a string matching section that identifies categories...present invention explains that "string matching section identifies categories...input documents by means of the string matching of the input documents and...
Patent No. 7,574,742 Issued on Aug. 11, Assigned to Industrial Technology...
News wire article from: US Fed News Service, Including US State News August 12, 2009 700+ words
...Felix Wu of Davis, Calif.,have developed a method of string matching for uniform data classification. The inventors were issued...groups match those of the input strings. In one aspect, the string matching further comprises matching the input strings with the signature...
Gene-IT Announces Gene Sequence Patent Strength Service Featuring GenePAST...
Press release article from: Business Wire December 2, 2002 700+ words
...Unlike traditional techniques, the GenePAST algorithm is based on an approximate string-matching algorithm that uses an objective measure - the string "edit distance" - to indicate the difference between two sequences. Thus, it produces exactly...
Patent No. 7,610,281 Issued on Oct. 27, Assigned to Oracle International for...
News wire article from: US Fed News Service, Including US State News October 29, 2009 700+ words
...from entries in the inverted index and logic to compute an edit distance between a string associated with a query and a string associated...logic configured to provide a signal corresponding to the edit distance." The original application was filed on Nov. 29, 2006...
Pervasive Software Unveils Data Matching Solution Based on Pervasive DataRush.
Press release article from: Business Wire March 10, 2009 700+ words
...Soundex, Metaphone, Double Metaphone and sub-string * Fields comparisons using matching algorithms including Levenshtein Edit Distance, Jaro, Jaro-Winkler, Jaro-Hef, Damerau-Levenshtein, Q-gram, Positional Q-gram, prefix, suffix and exact...
DWJ Television introduces first-ever virtual edit session on the Net.
Press release article from: Business Wire October 25, 1996 700+ words
...Internet is changing the world as much as the printing press did, and the Internet virtual edit session is changing the way we edit. Distance is no longer important. This is just one small step as video makes its way to the web," says Johnson. Professionals in...
True Query: a phoenix rises from the ashes.
Magazine article from: Online Fingerman, Susan January 1, 2003 700+ words
...ardently desire this capability in Web search engines. True Query offers fuzzy searching based on the Levenshtein Distance or Edit Distance algorithm. This measures the similarity between two strings. The distance is the number of deletions, insertions, or...
For more facts and information, see all results
©2009 Gale, a part of Cengage Learning. All rights reserved.
About us | FAQs | Contact us | Privacy policy | Terms and conditions
Other Gale sites: Encyclopedia.com | HighBeam Research | Acquire Content | Books & Authors | Goliath | MovieRetriever | Smart QandA