Computer Algorithms: Morris-Pratt String Searching

Introduction

We saw that neither brute force string searching nor Rabin-Karp string searching are effective. However in order to improve some algorithm, first we need to understand its principles in detail. We know already that brute force string matching is slow and we tried to improve it somehow by using a hash function in the Rabin-Karp algorithm. The problem is that Rabin-Karp has the same complexity as brute force string matching, which is O(mn).

Obviously we need a different approach, but to come with a different approach let’s see what’s wrong with brute force string searching. Indeed by taking a closer look at its principles we can answer the question.

In brute force matching we checked each character of the text with the first character of the pattern. In case of a match we shifted the comparison between the second character of the pattern and the next character of the text. The problem is that in case of a mismatch we must go several positions back in the text. Well in fact this technique can’t be optimized.

Morris-Pratt brute force string matching
In brute force string matching in case of a mismatch we go back and we compare characters that has been compared already!

As you can see on the picture above the problem is that once there is a mismatch we must rollback and start comparing from a position in the text that has been explored already. In our case we have checked the first, second, third and fourth letters, where there is a mismatch between the pattern and the text and then … we go back and start comparing from the second letter of the text.

This is completely useless, because we already know that the pattern begins with the letter “a” and no such letter happens to be between positions 1 and 3. So how can we improve this redundancy?

Overview

The answer of the question came to James H. Morris and Vaughan Pratt in 1977 when they described their algorithm, which by skipping lots of useless comparisons is more effective than brute force string matching. Let’s see it in detail. The only thing is to use the information gathered during the comparisons of the pattern and a possible match, as on the picture below.

Morris-Pratt basic principles
Morris-Pratt skips some comparisons by moving ahead to the next possible position of a match!

To do that first we have to preprocess the pattern in order to get possible positions for next matches. Thus after we start to find a possible match in case of a mismatch we’ll know exactly where we should jump in order to skip unusual comparisons.

Generating the Table of Next Positions

This is the tricky part in Morris-Pratt and that is how this algorithm overcomes the disadvantages of brute force string searching. Let’s see some pictures.

Morris-Pratt with no repeating letters in the pattern
It is clear that if the pattern consists only of different letters in case of a mismatch we should start comparing the next character of the text with the first character of the pattern!

However in case of repeating character in the pattern if we have a mismatch after that character a possible match must begin from this repeating character, as on the picture bellow.

Morris-Pratt with one repeating letter in the pattern
The next table is slightly different if the pattern has repeating character!

Finally if there are more than one repeating character in the text the “next” table will consist show their position.

Morris-Pratt more than one repeating letter in the pattern
The next table contains the positions of repeating letters!

After we have this table of possible “next” positions we can start exploring the text for our pattern.

Morris-Pratt

Implementation

Implementing Morris-Pratt isn’t difficult. First we have to preprocess the pattern and then perform the search. The following PHP code shows you how to do that.

/**
 * Pattern
 * 
 * @var string
 */
$pattern = 'mollis';
 
/**
 * Text to search
 * 
 * @var string
 */
$text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque eleifend nisi viverra ipsum elementum porttitor quis at justo. Aliquam ligula felis, dignissim sit amet lobortis eget, lacinia ac augue. Quisque nec est elit, nec ultricies magna. Ut mi libero, dictum sit amet mollis non, aliquam et augue';
 
/**
 * Preprocess the pattern and return the "next" table
 * 
 * @param string $pattern
 */
function preprocessMorrisPratt($pattern, &$nextTable)
{
	$i = 0;
	$j = $nextTable[0] = -1;
	$len = strlen($pattern);
 
	while ($i < $len) {
		while ($j > -1 && $pattern[$i] != $pattern[$j]) {
			$j = $nextTable[$j];
		}
 
		$nextTable[++$i] = ++$j;
	}
}
 
/**
 * Performs a string search with the Morris-Pratt algorithm
 * 
 * @param string $text
 * @param string $pattern
 */
function MorrisPratt($text, $pattern)
{
	// get the text and pattern lengths
	$n = strlen($text);
	$m = strlen($pattern);
	$nextTable = array();
 
	// calculate the next table
	preprocessMorrisPratt($pattern, $nextTable);
 
	$i = $j = 0;
	while ($j < $n) {
		while ($i > -1 && $pattern[$i] != $text[$j]) {
			$i = $nextTable[$i];
		}
		$i++;
		$j++;
		if ($i >= $m) {
			return $j - $i;
		}
	}
	return -1;
}
 
// 275
echo MorrisPratt($text, $pattern);

Complexity

This algorithm needs some time and space for preprocessing. Thus the preprocess of the pattern can be done in O(m), where m is the length of the pattern, while the search itself needs O(m+n). The good news is that you can do the preprocess only once and then perform the search as many times as you wish!

The following chart shows the complexity O(n+m) compared with O(nm) for 5 letter patterns.

Morris-Pratt complexity
After pre-processing with O(m) the complexity of searching is O(n+m). You can see on the chart how effective is Morris-Pratt string searching compared to brute force string searching!

Application

Why it’s cool

  1. Its searching complexity is O(m+n) which is faster than brute force and Rabin-Karp
  2. It’s fairly easy to implement

Why it isn’t cool

  1. It needs additional space and time – O(m) for pre-processing
  2. It can be optimized a bit (Knuth-Morris-Pratt)

Final Words

Obviously this algorithm is quite useful because it improves in some very elegant manner the brute force matching. In the other hand you must know that there are faster string searching algorithms like the Boyer-Moore algorithm. However the Morris-Pratt algorithm can be quite useful in many cases, so understanding its principles can be very handy.

3 thoughts on “Computer Algorithms: Morris-Pratt String Searching

  1. Hello Stoimen,

    First, thanks for your posts about string searching algorithms. You did a great job!

    But I have some queries about some pictures in this post:
    In the picture for pattern-with-no-repeating-char, when the mismatch happened at ‘d’, I think, rather than shift one char to the right in the text, it should compare with ‘a’ first. The mismatch only means ‘d’ != ‘c’, but it doesn’t mean ‘d’ != ‘a’. Also, according to the next table, ‘d’ should compare to position-0 which is ‘a’.
    In the picture for pattern-with-one-repeating-char, shall the text be ‘ababad’?

    Let me know if I misunderstand anything…

    PS: I’m translating your post into Chinese so that it can be read by more people. Thanks again for making such a great post about this algorithm. :D

    Best,
    ~Sophie

  2. Hi Stoimen
    This post is totally awesome. The way you have broken down complex concepts into simple words is great. But I have a small problem understanding this code snippet:
    while ($i -1 && $pattern[$i] != $pattern[$j]) {
    $j = $nextTable[$j];
    }

    $nextTable[++$i] = ++$j;
    }
    Won’t the Inner while lead to an Infinite loop? Can you please explain this?

    Keep up the good work
    Cheers

  3. while ($j > -1 && $pattern[$i] != $pattern[$j]) {
    	$j = $nextTable[$j];
    }
     
    $nextTable[++$i] = ++$j;

    Can you explain me this? this loop goes infinite….and I could not run ……can you clarify to me?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>