Straight Talk: Enter Data Deduplication

May 07, 2014

"A large number of the problems with tape are fixed by disk storage. However, disk storage is too expensive due to backup retention unless you only store unique data to reduce the total amount of storage required." - Bill Andrews, president and CEO, ExaGrid Systems

In the last two sections, we looked at why and how backup is a big issue and how organisations have now started to move from using tape for backup to using disk. Although we have learned that using disk eliminates the challenges of tape, due to backup retention (history and versions), straight disk quickly becomes too expensive. Disk-based backup appliances with deduplication solve the backup problems but also do so cost-effectively. We will compare scale-up and scale-out architectures as well as the different approaches to deduplication, their impact on backup performance, backup window length, restore performance, offsite tape copy performance, and VM instant recoveries.

In this section we will consider how organisations can use disk for backup, exclusively explain the various approaches to deduplication and compare different architectures and the impact on system scalability.

To comprehensively address the problems of tape, the optimal solution is to backup all data to disk onsite, replicate the backup data to disk offsite, and entirely eliminate tape along with its associated drawbacks — provided, as discussed previously, that the cost is equivalent to that of tape .

Since disk costs more per gigabyte than tape, the only answer is to find a way to use far less disk to store the same amount of data. If straight disk costs 20 times the price of an equivalent-sized tape library plus tapes, then if you can store the total amount of data on 1/20th of the disk, the costs will be equivalent.

Example: Assume you have 20TB of primary data onsite and you keep two weeks of nightly backups at 5TB/night (25 per cent of the total backup) with 10 weekly full backups. Total disk required onsite to back up 20TB of primary data is:

Two weeks of nightly backups (8 x 5TB) = 40TB

10 weekly full backups (10 x 20TB) = 200TB

Total: 240TB

It is clearly not reasonable to place 240TB of disk behind the backup server to back up 20TB of primary storage. The disk cost alone is prohibitive, not to mention rack space, power and cooling.

This is where data deduplication comes into play. In the preceding example, each week you are backing up a 20TB full and after 10 weeks, you have 240TB of data just for the full backups. However, as you backup from week to week, industry statistics show that only about two per cent of the data actually changes. So, with straight disk, in effect you are storing the entire 20TB each week when only 400GB each week has changed.

Disk-based backup with deduplication compares incoming backup data to backup data previously stored, and only stores the changes or unique data from backup to backup . This dramatically changes the economics of disk backup. This is explained in the following:

Example: To keep it simple, we will only illustrate the weekly full backup as an example. If in week one, you send in a 20TB full backup, and then each subsequent week you send in the same 20TB with only two per cent of the data (400GB) having changed, then you can just store the 400GB that is changing each week. Total disk required with deduplication onsite to backup 20TB of primary data with 10 weekly fulls is:

First full backup deduplicated by 2:1 (10TB) = 10TB

Nine subsequent weekly fulls deduplicated by 50:1 (400GB each) = 3.6TB

Total: 13.6TB

In this simple example, the 13.6TB of storage with deduplication requires seven per cent of the 200TB with straight disk if all 10 weeks were stored at 20TB each.

Of course, there is much more to this, which is explained in the sections that follow. In fact, the deduplication algorithms on the market in the leading products offer a 10:1 data reduction ratio to as much as a 50:1 reduction, with an average reduction of 20:1.

Deduplication approaches and ratios

The range of deduplication from 10:1 to as much as 50:1 is due to two factors:

The deduplication product itself - some products offer a more effective and refined approach to data deduplication than others

The mix of data types, which changes the overall deduplication ratio:

Compressed data and encrypted data does not deduplicate at all, so the deduplication ratio is 1:1

Database data has a very high deduplication ratio and can get ratios of +100:1 to even +1,000:1

Unstructured data may get 7:1 to 10:1

Data from virtualised servers often has a significant amount of redundancy and can get very high deduplication ratios

When you have a normal mix of databases, unstructured data, VMs, compressed or encrypted data, for the leading products you will see an average deduplication ratio of 20:1.

Deduplication is accomplished in three major ways in the industry, as described in the table below for an average mix of data types:

The reason why the average data reduction matters is that a greater average data reduction uses less disk for longer term retention as data grows. It also impacts how much bandwidth is required to replicate data offsite for disaster recovery. Some may look at the up-front cost of the system and make their buying decision based on initial price. However, if the lowest priced system has a poor deduplication ratio, then it will prove to be far more expensive over time due to the cost of disk (since the amount of data and retention will continue to grow), as well as the additional WAN bandwidth that will be required.

In evaluating disk backup with deduplication solutions, you need to first understand the approach the product is using. In order to save both disk capacity and WAN bandwidth over time, choose a product that gets you the greatest deduplication ratio. Ask about the specific deduplication approach, as some marketing literature will report high deduplication ratios using the best data type and not the mix of data that would be encountered in a real production environment. For example, industry marketing literature can say deduplication ratios of "up to" 50:1.

The key phrase is "up to," as these ratios are using a mix of data (e.g. databases) that achieve a high deduplication ratio. Your organisation needs to know the deduplication ratio for a standard mix of product data, including databases and VMs as well as unstructured, compressed, and encrypted data .

To summarise, not all products achieve the same deduplication ratios, since they use different algorithms for performing deduplication. The lower the deduplication ratio, the more disk capacity and WAN bandwidth is required over time, resulting in higher overall costs. The true cost is the price up front and the cost of disk and bandwidth over time.

Disk backup architectures and scalability

Once you understand the various deduplication approaches and their impact on disk and bandwidth, then you need to understand the different architectures for implementing disk-based backup with deduplication. The choice of overall architecture is important because it makes a significant difference in overall backup performance, keeping a fixed-length backup window that doesn't expand as data grows, and the ability of the system to scale without any expensive system forklift upgrades along the way.

In the average IT environment, primary storage data is growing by about 30 per cent per year and in some cases by up to 50 per cent per year. At 30 per cent per year, the data will double about every 2.5 years. So, if you have 30TB of data today, in 2.5 years you will have 60TB. The data growth needs to be taken into account to ensure the system can handle your data now, as well as your data growth in the future. Avoid choosing a system that hits the wall and requires you to replace it before it's been amortised.

Before discussing the pros and cons of the two different architectural approaches — "scale-up" and "scale-out" — let's first define the terms:

"Scale-up" typically refers to architectures that use a single, fixed resource controller for all processing. All data flows through a single front-end controller. To add capacity, you attach disk shelves behind the controller up to the maximum for which the controller is rated. As data grows, only disk capacity is added; however, no additional compute resources (processor, memory, bandwidth) are added to deduplicate the increasing data load.

"Scale-out" typically refers to architectures that scale compute performance and capacity in lockstep by providing full servers with disk capacity, processor, memory and network bandwidth. As data grows, all four resources are added: disk capacity, processor, memory and network bandwidth

See the following diagram for depictions of scale-up versus scale-out approaches:

Unlike primary storage — where data is simply stored, and as your data grows you just add more storage — in the case of disk backup with deduplication, all of the data must be compared before deduplication. This is the first time in storage where processing of the data takes place before storing. As a result, this requires compute resources (processor, memory and bandwidth) in addition to storage. Furthermore, as the data grows, the amount of existing data to be compared against grows; therefore, an increasing amount of compute resources are required (processor, memory and bandwidth) .

Because the deduplication process of comparing incoming backup data to previously stored data is computationally intensive, it is logical that the more data that comes into the disk-based backup system, the more compute power is required to compare and deduplicate the incoming backup data against the increasing amount of data already stored. To ingest, compare, and deduplicate twice the data with a fixed amount of compute resources would take twice the time as the original data amount. Therefore, systems that do not scale with additional processor, memory, and bandwidth will cause backup window expansion as data grows. The only way to avoid backup window expansion with data growth is to add compute, memory and bandwidth resources, along with disk capacity .

In the scale-up model (front-end controller with disk shelves), as data grows, the backup window gets increasingly long until the backup window is so long, it runs into end-user production time. To bring the backup window back into the allotted backup timeframe, the front-end controller needs to be upgraded to a much more powerful and faster controller. This is called a "forklift upgrade".

The cost to do a forklift upgrade can be as much as 70 per cent of the original cost of the system, since the controller is the most expensive component, and the new, more powerful controller is more expensive than the original, less powerful controller. The total cost of this approach is not just the initial up-front one, but also the cost of the subsequent forklift upgrade. Scale-up systems usually quote raw storage. As a general rule, for systems that quote raw storage, you can actually only store about 60 per cent of the raw storage as a full backup. For example, if the raw storage quoted is 32TB, then you can backup about a 20TB full backup.

Example: Let's say that you purchase a scale-up system that can expand to 76TB of raw disk storage capacity. The system will be able to take in a full backup that is about 60 per cent of the raw storage, or about 47TB. If you have a 28TB full backup and your data grows at 30 per cent, then you will be facing a forklift upgrade in about 23 months.

The net is that in 23 months you will need to replace the front-end controller with a faster, more powerful (and far more expensive) new front-end controller.

In the scale-out model, since full servers are added as data grows, if the data doubles, then the disk, processor, memory and bandwidth all double. If data triples, then all four resources triple and this continues as you keep growing data and adding scale-out appliances. Servers are added into a grid architecture with a single management interface and automatic load balancing to distribute the data and processing across all the servers. This approach has two major advantages. The first is that the backup window stays fixed in length — as the workload increases, so do all the resources, including disk and compute capacity. The second advantage is that there are no forklift upgrades. The system comes with disk and associated compute capacity, thereby scaling easily with data growth.

The architectural approach impacts planning up-front and over time. With a scale-up system, you risk over-buying or under-buying. You need to forecast data growth to ensure that you are not buying a costly system with excessive capacity relative to your needs today, or buying an under-sized system that will need a forklift upgrade in just three to six months as your data grows. With a scale-out approach, you don't need to plan, because as your data grows, the appropriate compute resources are supplied along with disk capacity. This approach, therefore, naturally scales with data growth.

The net is that the architectural approach is critical to the backup window and whether it stays fixed in length or keeps expanding. The architectural approach also affects long-term costs, since hidden down-the-road controller forklift upgrades are extremely expensive.

Disk backup architectures and restore performance

The key reason for performing backup is so that critical data can be recovered quickly in the event of human error or a system failure. Full system restores are the most critical restores, as hundreds to thousands of employees at a time can be down when a full system is down. The longer it takes to recover, the more lost hours of productivity. Two disk backup architectures are commonly used, with different resulting implications for restore performance. Inline deduplication is performed between the backup/media server and the disk, and data is stored in a deduplicated form. Post-process deduplication is performed after data lands to disk, and a full copy of the most recent backup is stored in a landing zone.

Inline deduplication is used by nearly all primary storage vendors and backup software-based deduplication implementations. With this approach, incoming backup data is deduplicated on the fly, before it lands on disk. This approach is faster than deduplication on the media server because it offloads processing from the media server and lets it stay dedicated to its task. However, inline deduplication can potentially cause a bottleneck at the point where data is streaming into the deduplication appliance, resulting in slower performance and a longer backup window. The greater impact is seen by users in recovery because this deduplication method stores all data in a deduplicated form and no full copy of backup data is kept. This slows down restores because it requires that a given backup be "rehydrated" - or put back together - from its deduplicated state before it can be restored . This approach also does not support instant recovery techniques available from an increasing number of backup applications today. Instant recovery allows for a granular level restore of a single file, single VM or object down to an individual email. To rehydrate the data, it can take an hour just to restore a single file.

Post-process deduplication with a landing zone allows backups to land on disk prior to deduplication. Deduplication is performed after the backup is complete, so backups are faster. This approach keeps the most recent backup on disk in a landing zone, in its complete and un-deduplicated full form. The approach of keeping a copy of the most recent backup in a landing zone in its full form – versus only deduplicated data that always needs to be rehydrated – produces significantly faster traditional full backup and image restores, much faster tape copies, and instant recoveries of files, objects, and VMs, that take minutes versus hours.

Where and how deduplication is implemented: Key implications

Deduplication solutions fall into four categories, based on where deduplication is performed in the backup process, as well as the underlying system architecture. Each of those four categories are detailed below, with the associated pros and cons.

Backup client or agent - only moves changed data over network:

Media server - software running on media server or pre-loaded on a storage server; 64KB or 128KB fixed block sizes:

Target-side appliance that deduplicates between backup/media server and disk - block-level deduplication with inline processing and scale-up architecture

Target-side appliance that deduplicates after data lands to disk - zone-level deduplication with byte-level deduplication and scale-out architecture

This guide explains the various backup complexities, enabling you to ask the right questions and make the right decision for your specific environment and requirements. Stay tuned for the next part of this guide, which will be live on ITProPortal shortly.

Author: Bill Andrews
View the original article here.
Published under license from ITProPortal.com