Hardware compression algorithms

Posted on Jan 14, 2022

Compression is a common technique that allows us to reduce the size of data. We use it when we store or send files as zip, 7z or rar archives. It helps reduce the space occupied by files in the filesystem and also to save bandwidth while transferring over a network.

In general, compression algorithms try to encode redundancy or regularity in data with lesser bits.

There are many different software compression algorithms available. Lempel-Ziv family of algorithms, used in tools like zip and 7z, works by replacing repeated chunks of data with a pointer to an earlier occurrence. This avoids repeating the same chunks of data over and over again.

Compression can also be applied at the hardware level, i.e. in microprocessors, system buses (PCIe), etc. But algorithms like Lempel-Ziv are too complex to be implemented at this level. When implementing at hardware level, we need simple, low latency and parallelizable algorithms.

In this post, we will look at some of the hardware compression algorithms that have been proposed in literature for cache, memory and link compression.

Applications of compression in hardware

Caches are effective in hiding the long latency access to memory. Their effectiveness can get reduced if the cache capacity is small or if the application exhibits irregular memory access patterns. Cache compression allows us to store more cachelines and thus increase the effective cache capacity.
Link compression compresses the data before sending and after receiving it over a communication channel (system bus or over a network). This reduces the bandwidth and power consumed by the data transfer.
Memory compression is used to increase the effective main memory capacity. Memory compression can be helpful in embedded systems that have limited memory.

Redundancy and Regularity in data

Hardware compression algorithms work by exploiting observations in data that show redundancy or regularity. Some of the observations that have been used to design hardware compression algorithms are:

Zeroes

There are a lot of zero values observed. Some reasons for this are:
- During allocation, most operating systems fill pages with zeroes before handing it over to an application to avoid memory leaks.
- Variables are usually initialized with zeros, either by the compiler or by the programmer.
Most Significant Bits are unused

Even though C compilers usually allocate 4 bytes of memory for an integer, most applications rarely use values that cannot be represented with one or two bytes. So the most significant bits are filled with continuous 0s or 1s depending on the sign extension.
Low Dynamic Range (Value locality)

Another observation is that most values that are stored together are close to each other in numeric value. The difference (or Delta) between adjacent or close-by values can be represented with lesser bits than representing the whole numbers.

Algorithms

Frequent Pattern Compression

Frequent pattern compression tries to encode common patterns within 4 Byte words. It exploits the presence of repeated zero words and the fact that most significant bits are usually not used. It uses 3 bits of metadata to indicate which pattern to use for the 4B word and some extra data bits depending on the pattern matched with.

The seven patterns that FPC uses are given in the table below (which is taken from the paper):

Prefix (bits)	Pattern	Data size
000	Zero run	3 bits (up to 8 zeroes)
001	4 bit sign extended	4 bits
010	1 byte sign extended	8 bits
011	Halfword sign extended	16 bits
100	Halfword padded with zero halfword	16 bits (non zero halfword)
101	Two halfwords, each a byte sign extended	16 bits (the two bytes)
110	Word with repeated bytes	8 bits
111	Uncompressed word	32 bits

Some 4B words and their FPC compressed form is shown below:

4B word (hex)	Prefix (bits)	Extra data (hex)
00000007	001	7
00000071	010	71
FFFF8000	011	8000
94949494	110	94

Base Delta Immediate

Base Delta Immediate compression tries to compress data by using low dynamic range. The algorithm stores a cacheline as a 4B base value followed by the deltas, i.e. the difference between the actual data and the base value. If there is high value locality, the deltas can be represented with less number of bits than required for 4B values.

For instance, consider a 32B data containing the following 4B values

0x00203008	0x00203009	0x0020301F	0x00203020	0x00203000	0x002030F0	0x002030EE	0x00203003

After applying BDI compression, we can store it like this:

0x00203008	0x00	0x01	0x17	0x18	0xF8	0xE8	0xE6	0xFB

By storing 1B deltas instead of the 4B words, we save 20 bytes.

CPack

Cache Packer algorithm is a dictionary based algorithm for hardware compression. During compression, It constructs a dictionary consisting of 4B words already seen during compression. If new 4B word matches partially or fully with a 4B word present in the dictionary, then the algorithm stores metadata (to identify which parts match with the dictionary word), index pointing to the matched dictionary entry and the bytes that differ.

The CPack pattern encodings for a dictionary size of 16, are shown below (taken from the paper):

Prefix (bits)	Pattern	Additional Data
00	zzzz (All zeroes)
01	xxxx (No match)	4B data
10	mmmm (Full match)	4b index
1100	mmxx (2B Partial match)	4b index + 2B data
1101	zzzx (1B word)	1B data
1110	mmmx (1B Partial match)	4b index + 1B data

For example, consider the following 4 words:

0x12345678	0x000000F0	0x12340608	0x12345678

When compressed with CPack, we get:

0x12345678	(1101) 0xF0	(1100) (0000) 0x0608	(10) (0000)

The dictionary is not stored as it is not needed. Because when compressing a 4B word in a cacheline, It is only compared with dictionary entries that were added before (never after) this current word is compressed. So during decompression, the dictionary is reconstructed and when a word is about to be decompressed, the dictionary values it needs will be in the reconstructed dictionary.

Bit Plane Compression

Bit Plane Compression is an algorithm that works at the bit level, unlike the algorithms discussed above. It exploits all three sources of redundancy: zeroes, unused most significant bits and value locality.

It performs a Delta-Bitplane-XOR transform operation on data to increase its compressibility. The Delta-Bitplane-XOR transform creates continuous runs of 0s and 1s which can then be compressed with run length encoding. For a 128B (GPU) cacheline, this transform consists of the following three steps:

Delta

The cacheline is split into 4B words. The difference or delta of each 4B word with that of the previous word is calculated. After this step, the cacheline data of 32 4B words, becomes 31 deltas along with the first 4B word. This step exploits value locality in data.
Bitplane

Bitplane operation is a bitwise operation that looks similar to a matrix transpose operation. This is applied on the deltas from the previous stage to create continuous runs of 0s or 1s.
XOR

The final step performs the XOR of each adjacent bitplane value from the previous step.

BPC works best when the data consists of homogeneous structures. Compared to the above algorithms, BPC has higher compression latency (11 DRAM cycles) but gives higher compression ratio, especially on GPU workloads, where homogeneous structures are more common.

Byte Select Compression

Byte Select Compression is a class of algorithms that uses a set of byte patterns to compress data. Each cacheline is compared to see if it matches any of the used patterns. If it matches a pattern, only the bytes required to reconstruct it, is stored along with metadata to identify the pattern matched. For example, if we use the pattern ABAB, where A, B are different bytes that are repeated in that word, we only have to store the two bytes A and B, and also store the pattern number of ABAB.

The efficacy of byte select compression depends on the set of patterns used. And for this, the researchers have come up with a technique to find the most frequent byte patterns from trace data. By collecting cache traces of applications, we can find the simplest describing pattern for each cacheline. Once we get the list of patterns, now the problem is to select a small set of patterns with high frequency of occurrence, which will be used for byte select compression.

This differs from other algorithms, in that other compression algorithms depend on human observation of patterns. Finding patterns from application data reveals the intricate, non obvious patterns, which when used for compression gives higher compression ratio.