This is the homepage of

The inital material of Forty Years of Text Indexing, shown below, is from the original of an invited contribution to CPM 2013, the 24th Annual Symposium on Combinatorial Pattern Matching, Bad Herrenalb, Germany June 17-19. See CPM 2013, for additional information.

All of the papers presented at CPM 2013 can be found at SpringerLink.

For information about the special session at CMP 2013, organized by Martin Farach-Colton and S. Muthukrishnan, celebrating the invention of the Suffix Tree by Peter G Weiner in 1973, see 1973+40=2013.

Six videos of the presentations at 1973+40=2013 by Martin Colton-Farach, Peter G Weiner, Vaughan R Pratt, Edward M McCreight, and Jacob Ziv, together with a concluding Panel Session, can be viewed on Vimeo at 1973+40=2013 Videos.

Forty Years of Text Indexing

Alberto Apostolico*, Maxime Crochemore**, Martin Farach-Colton***, Zvi Galil† , and S. Muthukrishnan ‡

Abstract. This paper reviews the first 40 years in the life of textual inverted indexes, their many incarnations, and their applications. The paper is non-technical and assumes some familiarity with the structures and constructions discussed. It is not meant to be exhaustive. It is meant to be a tribute to a ubiquitous tool of string matching — the suffix tree and its variants — and one of the most persistent subjects of study in the theory of algorithms.

Keywords: pattern matching, string searching, bi-tree, suffix tree, dawg, suffix automaton, factor automaton, suffix array, FM-index, wavelet tree.

1 Prolog

When William Legrand finally decrypted the string:

‡1;48 85;4)485†528806*81(ddag9;48;(88;4(‡?34;48)4‡;161;:188;‡?;

it did not seem to make much more sense than it did before. The decoded message read: “A good glass in the bishop’s hostel in the devil’s seat forty-one degrees and thirteen minutes northeast and by north main branch seventh limb east side shoot from the left eye of the death’s-head a bee line from the tree through the shot fifty feet out.” But at least it did sound more like natural language, and eventually guided the main character of Edgar Allan Poe’s “The Gold Bug” [56] to discover the treasure he had been after. Legrand solved a substitution cipher using symbol frequencies. He first looked for the most frequent symbol and changed it into the most frequent letter of English, then similarly treated the second most frequent symbol, and so on.

Both before and after 1843, the natural impulse when faced with some mysterious message has been to count frequencies of individual tokens or subassemblies in search of a clue. Perhaps one of the most intense and fascinating subjects for this kind of scrutiny has been bio-sequences. As soon as some such sequences became available, statistical analyses tried to link characters or blocks of characters to relevant biological function. With the early examples of whole genomes emerging in the mid 90’s, it seemed natural to count the occurrences of all blocks of size 1, 2, etc. up to any desired length, looking for statistical characterizations of coding regions, promoter regions, etc. [67]. This review is not about cryptography. It is about a data structure and its variants, and the many surprising and useful features it carries. Among these is the fact that, to set up a statistical table of occurrences for all substrings, of any length, of a text string of n characters, it only takes time and space linear in the length of the text string. While nobody would be so foolish to solve the problem by generating all exponentially many possible substrings and then counting their occurrences one by one, a text string may still contain O(n2) distinct substrings, so that tabulating all of them in linear space, never mind linear time, seems already puzzling.

Over the years, such structures have held center stage in text searching, indexing, statistics, and compression as well as in the assembly, alignment and comparison of biosequences. Their range of scope extends to areas as diverse as detecting plagiarism, finding surprising substrings in a text, testing the unique decipherability of a code, and more. Their impact on Computer Science and IT at large cannot be overstated. Text and Web searching and Bioinformatics would not be the same without them. In 2013, the Combinatorial Pattern Matching symposium celebrates the 40th anniversary of the appearance of Weiner's paper with a special session entirely dedicated to that event.

2 History Bits and Pieces

At the dawn of “stringology”, Don Knuth conjectured that the problem of finding the longest substring common to two long text sequences of total length n required (n log n) time. An O(n log n)-time had been provided by Karp, Miller and Rosenberg [40]. That construction was destined to play a role in parallel pattern matching [6, 24, 31, 32], but Knuth’s conjecture was short lived: in 1973, Peter Weiner showed that the problem admitted an elegant linear-time solution [68], as long as the alphabet of the string was fixed. Such a solution was actually a byproduct of a construction he had originally set up for a different purpose, i.e., identifying any substring of a textfile without specifying all of them. In doing so, Weiner introduced a notion of a textual inverted index that would elicit refinements, analyses and applications for forty years and counting, a feature hardly shared by any other data structure.

* * * (We stop near the bottom of page 2 out of 10 in the full published paper.)

* College of Computing, Georgia Institute of Technology, 801 Atlantic Drive, Atlanta, GA 30318, USA
** King’s College London, London WC2R 2LS, UK and Universite´ Paris-Est, France
*** Department of Computer Science, Rutgers University, Piscataway, NJ 08854, USA
† College of Computing, Georgia Institute of Technology, 801 Atlantic Drive, Atlanta, GA 30318, USA
‡ Department of Computer Science, Rutgers University, Piscataway, NJ 08854, USA

To get access to the full published paper go to SpringerLink.