LZ77
LZ77 is a lossless, streaming compression algorithm invented in 1977. The central concept is its use of a Sliding Window, which is a fixed size, sequentially ordered buffer; sometimes referred to as the history buffer. The LZ77 compression algorithm starts by reading in a string of bytes (from a stream) the size of the sliding window, it then checks if the history buffer contains an identical sequence of bytes. With a match, the algorithm reads the next byte from the stream and uses it as the token
field, together with the position and length of the matching byte block (in the history buffer) as the prefix
field; without a match, the algorithm will check each substring that always begins at byte zero in the byte stream to see if any of those match. If none of the substrings match, byte zero (from the byte stream) becomes the token
field in the next output compression packet. (the prefix
is empty, which might be expressed as null
)
Decompression starts by checking the prefix
field of the current packet and if it finds an entry, it reads the indicated number of bytes (starting from the position indicated). Any prefix byte string is outputted, followed by token
immediately after it.
Because it's only the tokens that get pushed onto the history buffer, you guarantee that the buffer content will be identical whether its generating (compression) or consuming (decompression) the packet.
I've implemented a small demonstration of the compression process here, I haven't bothered with decompression because it's super easy to implement. (I did implement decompression in this prototype here, but just know I consider it to be a low quality prototype).
The first git repository implements the sliding window via a custom library (I created) called tiny-toatie-cache as its storage engine. The second repository is a vanilla implementation of LZ77.
tiny-toatie-cache (TTC) is a fixed size storage engine and it allows you to append new bytes to the front and search for specific byte strings. When the internal storage is full, it deletes the oldest items first when it has to make space for new bytes at the front. The find method proxies requests through its internal caching system. The cache is designed to remember the offset and length of matching byte strings it previously found in its internal storage (which behaves exactly like a sliding window). The offsets of cached records are automatically corrected when new bytes push the buffer contents back. It also transparently invalidates cache records when offsets "fall off the end" of the buffer to make space for new bytes at the front.
The caching idea turned out to be a bit of a bust because you don't get too many duplicate prefixes, unless it's a really low entropy source file or something like that (say, if you used dd
to generate a zeroed file). Consider this input the the the the the
and the output packets:
Token | Prefix |
---|---|
't' | null |
'h' | null |
'e' | null |
' ' | null |
't' | 'the ' |
'h' | 'he t' |
'e' | 'e th' |
If you look at the last three packets, it gives an idea of how the prefixes go through a sort of 'phase shift' of sorts. This leads to very frequent cache misses on prefix search caching. In retrospect, it seems so obvious, but doesn't it always?
The table below is the verbatim output of lz77-nodejs-streams