Tag Archives: web hosting

Computer Algorithms: Data Compression with Diagram Encoding and Pattern Substitution

Overview

Two variants of run-length encoding are the diagram encoding and the pattern substitution algorithms. The diagram encoding is actually a very simple algorithm. Unlike run-length encoding, where the input stream must consists of many repeating elements, as “aaaaaaaa” for instance, which are very rare in a natural language, there are many so called “diagrams” in almost any natural language. In plain English there are some diagrams as “the”, “and”, “ing” (in the word “waiting” for example), “ a”, “ t”, “ e” and many doubled letters. Actually we can extend those diagrams by adding surrounding spaces. Thus we can encode not only “the”, but “ the “, which are 5 characters (2 spaces and 3 letters) with something shorter. In the other hand, as I said, in plain English there are two many doubled letters, which unfortunately aren’t something special for run-length encoding and the compression ratio will be small. Even worse the encoded text may happen to be longer than the input message. Let’s see some examples.

Let’s say we’ve to encode the message “successfully accomplished”, which consists of four doubled letters. However to compress it with run-length encoding we’ll need at least 8 characters, which doesn’t help us a lot.

// 8 chars replaced by 8 chars!?
input: 	"successfully accomplished"
output:	"su2ce2sfu2ly a2complished"

The problem is that if the input text contains numbers, “2” in particular, we’ve to chose an escape symbol (“@” for example), which we’ll use to mark where the encoded run begins. Thus if the input message is “2 successfully accomplished tasks”, it will be encoded as “2 su@2ce@2sfu@2ly a@2complished tasks”. Now the output message is longer!!! than the input string.

// the compressed message is longer!!!
input:	"2 successfully accomplished"
output:	"2 su@2ce@2sfu@2ly a@2complished tasks"

Again if the input stream contains the escape symbol, we have to find another one, and the problem is that it is often too difficult to find short escape symbol that doesn’t appear in the input text, without a full scan of the text. Continue reading Computer Algorithms: Data Compression with Diagram Encoding and Pattern Substitution