stoimen's web log

Introduction

No matter how fast today’s computers and networks are, the users will constantly need faster and faster services. To reduce the volume of the transferred data we usually use some sort of compression. That is why this computer sciences area will be always interesting to research and develop.

There are many data compression algorithms, some of them lossless, others lossy, but their main goal aways will be to spare storage space and traffic. These algorithms are very useful when talking about data transfer between two distant places. Perhaps the best example is the transfer between a web server and a browser.

In the last few years a lot of research has been done on compressing files, executed on the client side. Such files are javascript, css, htmls and images. In fact servers and clients already have some techniques to compress data, like using GZIP for instance, that can dramatically decrease the transfer. In the other hand there are lots of tools and tricks in order to decrease the size of the data.

Actually when a file is executed by the client’s virtual machine, it doesn’t matter how “beautifully” it is formatted from a programmer’s point of view. Thus the spaces, tabs and the new lines don’t bring any significant information for the environment. That is why such compressing tools like YUI Compressor, Google Closure Compiler, etc. remove those symbols. Well, they can achieve even more in order to improve the compression rate. In this post I won’t cover this, but this shows how important data compression algorithms are.

It would be great if we could just compress data with some tool. Unfortunately this is not the case and usually the compression rate depends on the data itself. It is obvious that the choice of data compression algorithm depends mainly on the data and first of all we must explore the data.

Here I’ll cover one very simple lossless data compression algorithm called “run-length encoding” that can be very useful in some cases.

Overview

This algorithm consists of replacing large sequences of repeating data with only one item of this data followed by a counter showing how many times this item is repeated. To become clearer let’s see a string example.

aaaaaaaaaabbbaxxxxyyyzyx

This string’s length is 24 and as we can see there are lots of repetitions. Using the run-length algorithm, we replace any run with shorter string followed by a counter.

a10b3a1x4y3z1y1x1

The length of this string is 17, which is approximately 70% of the initial length. Obviously this is not the optimal way to compress the given string. For instance we don’t need to use the digit “1” when the character is repeated only once. In some cases this approach can increase the length of the initial string which is exactly the opposite of what we need. In this case we’ll get the string bellow.

a10b3ax4y3zyx

Now the length of the resulting string is 13, which is 54% of the initial length! A variation of the example above is not to keep a counter of the repetitions of the character, but their position instead. Thus the initial string will be compressed as follows.

a0b10a13x14y18z21y22x23

Which of these two approaches you’ll use depends on the goal. In the second case we can achieve a good optimization of binary search.

It is clear that this algorithm is not only applicable on strings. We can achieve very good results on arrays. A typical example is the transfer of JSON from a server to a client. Then if there are large sequences of repeating data we can achieve great results.

Implementation

The implementation bellow is assuming that we’re compressing a string and it’s written on PHP. However the nature of this algorithm doesn’t restrict us to use only strings. As I said before with slight modifications we can use it with other data structures. It is important only to understand that the run-length algorithm is very useful on large sequences of repeating elements, no matter characters or array items.

$message = 'aaaaaaaaaabbbaxxxxyyyzyx';
 
function run_length_encode($msg)
{
	$i = $j = 0;
	$prev = '';
	$output = '';
 
	while ($msg[$i]) {
		if ($msg[$i] != $prev) {
 
			if ($i) 
				$output .= $j;
 
			$output .= $msg[$i];
 
			$prev = $msg[$i];
 
			$j = 0;
		}
		$j++;
		$i++;
	}
 
	$output .= $j;
 
	return $output;
}
 
// a10b3a1x4y3z1y1x1
echo run_length_encode($message);

And slightly optimized.

$message = 'aaaaaaaaaabbbaxxxxyyyzyx';
 
function run_length_encode($msg)
{
	$i = $j = 0;
	$prev = '';
	$output = '';
 
	while ($msg[$i]) {
		if ($msg[$i] != $prev) {
 
			if ($i && $j > 1) 
				$output .= $j;
 
			$output .= $msg[$i];
 
			$prev = $msg[$i];
 
			$j = 0;
		}
		$j++;
		$i++;
	}
 
	if ($j > 1)
		$output .= $j;
 
	return $output;
}
 
// a10b3ax4y3zyx
echo run_length_encode($message);

Finally a small change – now we store the position of the character.

$message = 'aaaaaaaaaabbbaxxxxyyyzyx';
 
function run_length_encode($msg)
{
	$i = 0;
	$prev = '';
	$output = '';
 
	while ($msg[$i]) {
		if ($msg[$i] != $prev) {
 
			$output .= $msg[$i] . $i;
 
			$prev = $msg[$i];
 
		}
 
		$i++;
	}
 
	return $output;
}
 
// a0b10a13x14y18z21y22x23
echo run_length_encode($message);

Complexity and Data Compression

We’re used to talk about complexity of an algorithm measuring time and we usually try to find the fastest implementation, like in search algorithms. Here it is not so important to compress data quickly, but to compress as much as possible so the output is as small as possible without lossing data. A great feature of run-length encoding is that this algorithm is easy to implement.

Application

We can use run-length encoding in many cases. It is commonly used to compress images and is very successful when we deal only with black and white images. Here I’ll cover another use case that I only mentioned above. Let’s say we have to transfer a very large array of data to our AJAX-powered application using JSON. Let’s say also that the data are some years, for instance the years of the premiere of a movie. There are lots of movies with a premiere in the same year, thus although the data is sorted, we actually can’t have any benefit. More important is that we have large sequences of data. Here we can use run-length encoding.

$data = array(
	0 	=> 1991,
	1 	=> 1991,
	...
	2223 	=> 1991,
	2224 	=> 1992,
	...
	19298 	=> 1995,
	19299 	=> 1996,
	...
);

As you can see to transfer the whole array can be a nightmare, especially on slow networks. It is better to compress it (i.e. with PHP’s json_encode).

// {"0":1991,"1":1991, ..., "2223":1991,"2224":1992, ..., "19298":1995,"19299":1996, ...}
echo json_encode($data);

After running run-length encoding we can receive something like the following array (note that these are only sample data and it’s up to you to decide which is the best format to store data).

$data = array(
	0 => array(1991, 2224),
	1 => array(1992, 3948),
	2 => array(1995, 2398),
	3 => array(1996, 3489),
);

And the JSON output.

// [[1991,2224],[1992,3948],[1995,2398],[1996,3489]]
echo json_encode($data);

Note that if the data is sorted we can achieve great success compressing it!!! This approach can be used for images, graphics or map coordinates.

This is only one example of how data compression can be useful in our daily work. Although the communication between the server and the client can be optimized and compressed, we can improve it. In other words we’re not always sure that the opposite side supports compression.

Well, it’s true that the client has to decompress the data, which can also be slow. Now in the first case we have only the time to transfer, as on the diagram bellow.

Data Transfer Without Compression — Time to transfer data without compression!

In the second case, we should sum the time for compression, transfer and decompression.

Data Transfer with Compression — Time to send data with compression!

All this is important, but in general data compression can be handy in many cases in our daily work.

16 thoughts on “Computer Algorithms: Data Compression with Run-length Encoding”

Giovanni says:

February 4, 2012 at 1:35 am

Thank you very much, I really like this series of articles on data compression.
Matthew Coleman says:

March 13, 2012 at 5:02 pm

You stated that in these cases we aren’t concerned about the speed of compression, but it’s definitely acceptable how it is — Theta(n). I’m going to read the rest of your articles, but I’m curious about instances where you would need to compress data very quickly — wouldn’t some scenarios of client-server response for a website need quick compression, probably ideally in a O(c) range?

Great series, thanks!
aditya says:

March 27, 2012 at 8:44 pm

hey i am trying to perform rle on 10 64-bit numbers in mips do you have any clue how its done.Any help is appreciated..
aditya says:

March 27, 2012 at 8:46 pm

btw your article was useful….
www.itrentals.com says:

April 30, 2012 at 2:57 pm

Although the topic of algorithms and data compression is fairly complicated, I found your article quite informative and easy to read.
huda says:

July 25, 2012 at 3:06 pm

thank u very much but could u give me the flowchart of execution this algorithem and if u dont mind could u give the program in matlab or give me the website that could help me find program to execute this type .
Stoimen says:

July 25, 2012 at 3:09 pm

Well you can code the program using RLE easily. However it depends on the input and the desired solution.
dawn says:

August 22, 2012 at 4:15 am

Thank u for give us the explaination about the RLC so completely. And would u mind give some references about this? I will appreciate u will do this!!!
hassan says:

February 25, 2013 at 9:48 am

hello we are would like the program in matlab if you want that
thank you for your lesten
tetsuo says:

November 25, 2014 at 4:48 pm

Hi! I was wondering if the algorithm to encode the position instead of length has a name cause this is the first time I read about it and I can’t find mentions to it anywhere.

TIA
Hajar says:

January 6, 2015 at 12:54 am

Amazing Article , Thanks alot .
Alka Sawlikar says:

September 6, 2016 at 1:10 pm

I need pure RLE lossless compression Kindly send on my mail ID
Andy says:

October 9, 2016 at 5:00 pm

I like RLE. I am using it on base 64 encoded strings which usually have large sections of repeated printable characters. I my implementation I use non-printable characters as aliases to counts. Non-printable characters are guaranteed to not exist in base 64 encoded strings. This helps the bloating of text that does not lend itself well to RLE.
powturbo says:

January 27, 2017 at 9:39 pm

you or other readers might be interested to have a look at TURBRLE: Efficient and fast Run Length Encoding incl. automatic escape determination, SIMD and zero byte overhead:
https://github.com/powturbo/TurboRLE
powturbo says:

April 15, 2017 at 9:14 pm

you or other readers might be interested to have a look at
https://github.com/powturbo/TurboRLE
Yoga says:

December 5, 2017 at 8:01 am

can you make the decode function please?