Data Compression Definition: Description, Benefits & Considerations

By Tong Zhang
Best Practices,

Data Compression Definition

This is part of Solutions Review’s Premium Content Series, a collection of contributed columns written by industry experts in maturing software categories. In this submission, ScaleFlux Co-Founder and Chief Scientist Tong Zhang offers a data compression definition and a look at the benefits, drawbacks, and possible roadblocks.

SR Premium The importance of data compression cannot be overstated: The time for companies to adopt modern data compression technology as a default was yesterday.

Data growth has shifted from an interesting curiosity to an alarming, unstoppable force of nature. We are producing data faster than we can process it. It’s a troubling notion to know that there’s enormous value to be extracted from data but it’s becoming increasingly difficult and expensive to do so.

This problem is accelerating faster than technology can keep up with, and with the 5G wave only a few years away, the pressure is on. Enterprises are scrambling to cope and frustrated by the complexity of the problem, but there is an obvious solution in plain sight. It’s time to take a hard look at data compression and the assumptions we have regarding its costs and benefits.

Data compression has been around for a long time and is generally understood as the modification of data in such a way that it occupies less storage. There is a common myth that compressing data means that speed and performance are compromised, but this couldn’t be further from the truth. Many assumptions about data have evolved in recent years: how we generate it, store it, and value it—and data compression is no exception. Data compression looks very different now than it did a decade ago, so to remain competitive we must adopt a modern mindset.

It’s time to explore what types of data are compressible, how data compression has changed, plus the benefits of adopting this strategy. Let’s bust some of the most common data compression myths.

Data Compression Definition

What’s Compressible and What Isn’t?

With the exceptions of multimedia data (e.g. video and images) and encrypted data, essentially all other types of data are compressible. In fact, it can be argued that a company’s most high-value data is inherently compressible. This includes transactional data, sensing data, IIoT (industrial internet of things) log, and messaging and streaming data. These types of text-like data must be processed in real-time so that users can extract its true value from a business perspective. And for companies that use machine learning and artificial intelligence models, this data is crucial for training neural models.

Currently, the relative percentage of compressible data remains unchanged, but the volume of data being generated is increasing exponentially given the exploding digital universe. So the question becomes: How can companies store, process, and make sense of all this data in order to achieve better business outcomes?

New Ways of Doing Compression

There’s a common misconception that compressing data results in reduced speed and performance. This is due to the fact that, by nature, compression algorithms are ill-fit to the modern CPU (central processing unit) architecture. No matter how well engineers architect CPUs, compression algorithms always encounter performance issues when running on the CPU.

Thankfully, there’s a way to get around this problem. By moving compression off of the CPU, companies can experience a slew of benefits (more on this later), including significantly increased speed and performance. But where should compression happen if not on the CPU? There are several ways to go about it—and there’s no one-size-fits-all solution—but SSDs (solid state drives) with built-in transparent compression are an excellent way to overcome speed and performance barriers while also reducing storage costs.

A Look at the Benefits

Speaking of cost, data compression is the most effective and easy-to-deploy means of reducing the overall cost of enterprise data storage. Other cost-reduction techniques, like deduplication, are much more complicated to deploy and manage, and can cause marked speed performance degradation. In addition to lowering storage costs, compressing data can also positively impact performance and reduce latency.

Let’s take a look at a few scenarios where compressing data can provide significant benefits.

Relational databases (e.g., MySQL, PostgreSQL, Oracle, SQL Server): It’s well known that relational databases contain highly compressible data. However, users of relational databases rarely use CPUs to compress said data due to its impact on speed performance. By deploying a solution that transparently compresses relational database data at zero CPU overhead, users can experience an over 50 percent storage cost savings and over 2x speed performance improvement.
Latency-critical key-value store (e.g., Aerospike, CacheLib): Due to their latency-critical nature and inherent data structure, key-value stores like Aerospike (which is widely used in latency-critical systems like finance) cannot realize CPU-based compression on their own, in spite of their highly compressible data. By deploying a solution that transparently compresses key-value store data at no CPU overhead, users can enjoy a greater than 50 percent storage cost savings and over 5x reduction on tail latency.
Data streaming platform (e.g., Apache Kafka): Being widely deployed in modern IT infrastructure, data streaming platforms like Kafka consume a significant amount of storage and networking resources. Most streaming data are highly compressible, but high throughput data streaming make it difficult to use CPUs for data compression. By transparently compressing streaming data, users can benefit from an over 50 percent cost reduction in storage in addition to networking cost savings.

Is Sustainable Data Growth Possible?

Technologies like 5G are right on the horizon, and companies are about to be hit with a tidal wave of data that will push their storage beyond its physical limits. Companies can no longer rely on their storage solutions to scale out indefinitely. It’s time to consider solutions that reduce the data footprint and get total storage costs under control before the next data explosion takes place. Companies that turn to data compression to tackle this problem proactively will come out on top while also reaping the benefits of increased speed and performance.

Author
Recent Posts

Tong Zhang

Chief Scientist at ScaleFlux

Dr. Tong Zhang is co-founder and Chief Scientist, ScaleFlux. Dr. Zhang is a well-established researcher with significant contributions to the areas of data storage systems and VLSI signal processing. He is responsible for developing key techniques and algorithms for Computational Storage products and exploring their optimal use in mainstream application domains such as database. He is currently a Professor at Rensselaer Polytechnic Institute (RPI).

Tong Zhang

Best Practices

Data Compression Definition: Description, Benefits & Considerations