Lossless Compression of Small Strings

I recently became interested in the problem of how to compress small strings (< 100 chars) after a short conversation with a coworker and reading antirez’s Smaz library.

Lossy Compression:

As the name implies, lossy compression involves compressing data in a way that the original message can never be perfectly recovered: information is lost. This is less of a problem for media data such as audio, images, and video since it’s not the individual bits that matter, but the larger pattern. In text, however, this matters a great deal compression of small strings so we’ll leave it at that.

[Thought: Would it be possible to write a lossy compression algorithm that exhaustively searches for a cellular automata or some sort of similar recursive algorithm that generates the lossy data we wish to compress? The result only has to roughly match. It could be a way to trade tons of CPU cycles for a low quality, but super lightweight image.]

Lossless Compression Types:

Lossless compression, or “reduced redundancy” usually involves finding patterns in the data, storing those patterns in what is called a “catalog” or “dictionary”, and providing pointers to the data. This is known as “dictionary compression”.

An Application for Small String Compression

One good application I can think of for compressing small strings is in the use-case of static url shorteners. Most url shorteners today require some state. If we were to create stateless shorteners, however, you could plug your url into them from any site or even locally (in-browser plugin?) and still get the same result.

The key to having a good static url shortener would be to decrease the length of urls to ideally less than 10 characters. My first thought was to use base64 encoded smaz, but upon further thought I believe we could get better results using a dictionary optimized for url substrings in particular. One way that Smaz limits itself is that the dictionary can only be 253 elements large. If we were to use a dictionary the size of say, all the printable unicode characters, then we could likely compress urls into a few unicode characters. I know this might be a bit crazy, since unicode is not officially supported in the url RFC, but most browsers support it.

[Thought: Why can’t we have binary-to-text encodings of higher radix than 64 using all the printable unicode characters?]