BlogFive Data Deduplication Best Practices for Optimal Data Storage Management

Data & Storage - Delivering Digital Transformation

In most environments, duplicated data is pervasive. That’s because test and development data multiplies across an organization over time. Replication, backup, and data archiving create multiple data copies scattered across the enterprise, and users often copy data to multiple locations for their own convenience. Today, studies estimate that multiple copies of data require companies to buy, use, and administer two to fifty times more storage than they’d need with data deduplication.

There are essentially two ways to reduce the cost of your data storage. First, you can try to leverage a lower-cost storage platform, which results in a different set of problems. Your other option is to leverage data deduplication to reduce your data growth and total required storage.

Here are five data deduplication best practices to keep in mind as you develop your strategy:

1.Consider the Broad Implications of Deduplication

Like disk-to-disk backup or server virtualization, you don’t want to evaluate deduplication as an isolated product or feature. Instead, consider the broader implications of deduplication within the context of your entire data management and storage strategy.

For example, deduplication can be performed at the file, block, and byte levels. Consider the tradeoffs for each method, which include computational time, accuracy, level of duplication detected, index size, and in some cases, the scalability of the solution.

Also, consider how you can use deduplication to eliminate tape where it makes sense in your environment. That might be remote offices or any locations where your company doesn’t have trained IT personnel.

2.Learn What Data Does Not Dedupe Well

In the simplest terms, data created by humans—documents, transactions, and email, for example—dedupes well in most dedupe systems. Photos, audio, video, imaging, or data created by computers generally don’t dedupe well, so you should store these sets of data on non-deduped storage. Learn what data does not dedupe well in your particular environment. For some situations, you might consider a deduplication solution that can selectively avoid certain sets of data.

3.Don’t Obsess Over Space Reduction Ratios

The length of time that data is retained affects data deduplication ratios in two ways: If more data is examined when deduplicating new data, you’re more likely to find duplicate data and increase space savings.

While you should closely examine this number when you’re comparing multiple products, try not to overanalyze this number once your system is up and running. Rather than performing more frequent full backups just to get a better data deduplication ratio, consider increasing your backup retention period for your on-disk data store. Once you have your first set of backups on disk, adding additional backups to that same deduped system will take up less space than sending them to tape.

4.Don’t Use Multiplexing if You’re Backing up to a VTL

If you’re backing up to a virtual tape library (VTL), don’t use multiplexing. Even if your deduplication solution can de-multiplex data, consider turning this feature off. Often a carryover practice from writing to physical tapes, multiplexing data merely wastes computing cycles— cycles that could otherwise be used to dedupe your data faster. For example, instead of multiplexing ten backups to two virtual tape drives, create twenty virtual tape drives and turn off multiplexing.

5.Pilot Multiple Systems Before Selecting Your Solution

Before selecting your deduplication solution, try to pilot several deduplication systems in your environment. While current vendors offer many good solutions and various deduplication approaches, you may also find some products with real limitations. Only by comparing multiple products can you best determine the optimum approach for deduping your data, whether it’s inline, post-process, target-side, client-side, via backup software, etc.

Common challenges of deploying a deduplication solution involve problems related to performance, increased complexity of management, and proliferation of deduplicated data silos. To avoid unnecessary complications, first ensure ease of integration into your existing environment and get customer references in your industry. Take time to understand the vendor’s roadmap, but test everything. Once you’ve selected your data deduplication solution, follow the best practices suggested by your deduplication solution vendor.

When evaluating deduplication solutions, look for the following essential features:

  • Ability to scale without expensive hardware upgrades
  • More recovery points and with shorter recovery times
  • Point-and-click deduplication management
  • Built-in reporting of deduplication across vendors, data types, sources, and platforms
  • Tight integration with all necessary applications to minimize end-user downtime
  • Single solution simplicity for ease of deployment and administration
  • Ability to rapidly and securely recover business-critical data across all locations, applications, storage media and points-in-time
  • D2D2T-optimized for backup performance and reliable data recovery
  • Fast, comprehensive search to aid in recovery
  • Data integrity and security features
  • Built-in Disaster Recovery capabilities
  • Data classification
  • Cost-effective and timely eDiscovery
  • A common technology platform
  • Single point of management

Armed with these five deduplication best practices, your data storage management efficiency will undoubtedly improve.

About the Author

Dustin Smith

Dustin Smith, Chief Technologist

Throughout his twenty-five year career, Dustin Smith has specialized in designing enterprise architectural solutions. As the Chief Technologist at ASG, Dustin uses his advanced understanding of cloud compute models to help customers develop and align their cloud strategies with their business objectives. His master-level engineering knowledge spans storage, systems, and networking.