Solutions Review spoke with Douglas Chando, solutions architect at Cobalt Iron about the pros and cons of using deduplication appliances for backup.
For people who aren’t aware, what is a deduplication appliance?
A deduplication appliance is a device that was intended to enable organizations to eliminate their dependency on tape by more efficiently writing data, or backup copies, to disk. One of the more well-known deduplication appliances in the market is Data Domain, and their tagline for one of their original campaigns was “Tape sucks”.
Data Domain, as well as any other deduplication appliance, is a pre-configured device that combines compute, storage, and an OS. It is purpose-built for ingesting data, finding commonality in the blocks of data, and reducing the common data to only a single instance, thereby reducing the storage footprint. One of the more common use cases for these appliances was/is backup storage repositories, and due to the duplicity of the nature of backups, deduplication had a huge impact on the efficiency in how backup data is stored.
How does deduplication work for backup data?
Great question. The purpose of deduplication devices is to take blocks of data, compare it to other blocks of data on the device, and find common blocks so that redundant data can be reduced. This is done by using either an SHA or MD algorithm. While this was a great approach when deduplication initially came out over 15 years ago, the downside is that this process can be very resource-intensive. Dedicating hardware resources to the process of deduplication ensured repeatable and predictable performance – but technology has moved on quite significantly over the years.
At their advent, these deduplication appliances did what is commonly referred to as post-process deduplication, meaning that the data was sent to the device and was deduplicated afterward. Although deduplication did a great job in reducing the data that was stored on the device, it meant that redundant data flooded the network on a routine basis.
Can you give an example?
Sure thing. To illustrate, let’s say every night when a backup would run it would send its full data set of 500GB. However, when the data reached the dedupe device, it would only write the 10MB of changes that might have occurred. This wouldn’t exactly be ideal because of how long the process takes, and how taxing this can be on a network when the same thing is occurring across many systems.
As time went on, this network inefficiency became a real problem, and with data increasing exponentially, it wasn’t something that was going away. To address this network inefficiency, Data Domain introduced a technology called Boost, which was software that resided on a client that would essentially look at data that was being sent before it was sent to the appliance. If it was a duplicate, it would not send it – this is what is commonly called client-side or in-line deduplication.
While it was a needed enhancement to deduplication technology, the question soon arose from a budgeting standpoint: “If I am using the compute resources on my server to perform deduplication, why do I need a costly purpose-built appliance to do this?” Because of the high cost, combined with many backup solutions having a form of software-based deduplication, data professionals have started to question if a dedicated deduplication appliance is still necessary.
With what you have pointed out, why do you believe dedicated deduplication appliances are still in such heavy use?
Initially, the advantages of using a deduplication device were very obvious. Organizations were looking to get away from tape backup technologies, but writing all of their data to disk was cost-prohibitive. Deduplication enabled them to consume a fraction of the disk, thereby affording them the opportunity to consider a disk-based backup strategy. However, with advancements in technology, the benefits of a dedicated appliance specifically for the purpose of deduplication have diminished significantly.
I attribute the demand of dedicated deduplication appliances to legacy solutions that have it heavily embedded into an organization’s infrastructure, combined with diehard advocates of these appliances. Many organizations continue to use deduplication devices because they follow the old adage, “if it ain’t broke, don’t fix it”. However, as discussed earlier, data size increases will push this aging technology to its limits, and the corrective action of refreshing the technology will make it cost-prohibitive to continue the course.
Other than what you have already mentioned, what are the disadvantages of using dedicated deduplication?
Here is what I consider the main disadvantages:
- Cost – In a world where backup storage costs are measured in $/TB, they are by far the most expensive, often coming in at a price per TB near primary storage.
- Security – these devices are managed separately from both the storage and backup landscapes, and as a result, they are often targets for malicious attacks that are looking for open or less secure CIFS and NFS targets.
- Performance – these devices are optimized for writing data, but when data needs to be recovered, they often perform at sub-optimal speeds.
- Network impact – Because these devices primarily do post-process deduplication, they have a negative effect on the network when large data sets are routinely sent to them.
- Redundant technology – deduplication is a very needed technology today, but the reality is, there are many solutions that have deduplication natively built-in.
I like to compare deduplication devices vs modern software-based deduplication to typewriters and laptops. Back in the day, you needed a typewriter– everyone had at least one in their house. You couldn’t get any professional work done without one. Ask anyone today whether they use a typewriter or a computer to work on. Unless it’s an ironic art piece sitting on a shelf, they are going to tell you they get their work done on a computer. The same can be said of deduplication devices, at one point they were a good technology, but they have been replaced by other solutions that do the same thing and more.
For anyone considering switching off of their dedicated deduplication solution, what would you want them to consider?
For me, I would want anyone looking at their current offering to ask themselves these questions:
- How has digital transformation and data growth impacted the use of my deduplication devices?
- With governance and compliance regulations changing, how will my organization handle long term retention with dedupe devices?
- With the cost of storage being as inexpensive as it is, are dedupe appliances even necessary?
- Can cloud storage effectively augment/replace my deduplication devices?
- I use my dedupe appliance as a VTL, are there other solutions out there that do this as well?
- With so many cyber threats and attacks in the news, are dedupe appliances adding any additional protection or security to my data?
Learn more about Cobalt Iron.
Latest posts by Tess Hanna (see all)
- DCIG Names its Top 5 Azure Cloud Backup Solutions - September 21, 2020
- Data Backup Best Practices with Cobalt Iron’s Andy Hurt - September 18, 2020
- The 11 Most Essential Books for Data Center Directors - September 16, 2020