A backup and disaster recovery strategy is absolutely critical to the IT and business health of most organizations. The right backup and disaster recovery strategy minimizes – even eliminates – unplanned downtime, which can derail employee productivity, negatively impact revenue and lead to long-term customer bad will, in addition to potential regulations compliance and legal issues. Unfortunately, natural disasters, human error and security breaches such as ransomware attacks are today’s reality. Thankfully, with the right backup and disaster recovery strategy in place, their consequences can be avoided. Unless… the technology that you are depending upon to serve as the backbone of your meticulously designed, deployed and tested backup and disaster recovery strategy is faulty.
Where might that fault lie? The truth is, the hard drives inside your data storage from one trusted vendor versus another trusted vendor may not be quite as reliable as you had hoped. This can be puzzling, since virtually all disk drives come from just a few manufacturers worldwide, and are relatively comparable to each other. Yet, there is a clear variation in disk drive and storage system reliability. But, how can this be? The answer lies not with the disk vendors, but rather with the storage vendors’ approach to three critical areas:
- Product design
- Manufacturing processes
- QA testing
Disparity frequently happens between storage systems in each of these functions, which can affect optimal reliability. Obsolete designs and processes in these areas can result in lower quality, in addition to needless storage-management costs, management time and business disruption.
Many storage vendors will point to their use of RAID (redundant array of independent disks) which combines multiple physical disk drives into one or more logical units (LUNs) for the purposes of data redundancy, performance improvement or both, as to why disk failure is a non-issue for their product. However, while RAID can be effective for data protection it is not a cure-all, not suited for every environment and budget, and carries with it some downsides and risks.
Why Do Failures Happen?
What causes disk drives to underperform or downright fail? The top reasons are:
- Vibration: There is an extremely high level of accuracy that is necessary for proper positioning of the heads that are responsible for both reading and writing data onto a disk. These heads have a very tight space in which to perform optimally, to say the least—they hover at a height that is only a few nanometers above the disk, which is much smaller than even a strand of human hair. In less than ideal environments with vibration, it can be challenging, if not impossible, for the heads to be properly positioned, which can cause the disk to abort. There are a wide range of problematic environments that can lead to high-frequency vibration such as subways, airports, submarines and cruise ships, to a NASA space shuttle.
- Temperature: Certain geographies, as well as some of the same types of less-than-ideal operational environments that cause vibration, can also feature excessive temperatures. Disk drives can have an increased rate of failure when an operation is attempted in extreme temperatures, as is the case with many electromechanical devices.
- Service interruptions: When service is disrupted—for example, because of routine maintenance or the need to replace system components—it can lead to malfunctioning disk drives.
- Interconnection failures: Various physical interconnect failures—from physical damage or failure of signal path components, to connection-path contamination—can result in the appearance of a missing disk.
- Protocol failures: When protocol failures occur, it can lead to potential data loss due to problems with input/output requests. A wide range of such protocol errors can happen, whether from data center switches or protocol incompatibility between different manufacturers.
- Defects: Sometimes a disk drive will be found to contain defects from the manufacturing process itself.
- Performance failures. Sector re-mapping is one example of an activity that can overload multiple disks with recovery activity at the disk level, thus leading to performance issues.
- Defects and Damage: Sometimes a disk drive arrives from the manufacturer with defects from the manufacturing process itself. Or, delicate disk drives can be damaged during shipment if they are not adequately protected from knocks and drops.
While the list of reasons that a disk drive can fail appears long, there are reasonable measures that can be taken by disk drive and storage vendors that can help to ensure reliability, and consequently your backup and DR as well. These include:
- Designing to Prevent Vibration: Disk drive reliability begins at the engineering and design stage. A data storage system needs an anti-vibration design for optimal performance. When hard drives are placed near each other, their mutual vibration can disrupt their neighbor’s read/write ability. A solution is to position drives back-to-back, which helps control high-frequency vibration. You can also design for anti-vibration by introducing greater rigidity than the standard steel construction of many storage systems. By beefing up stiffness and mass through the use of aluminum, vibration is more easily absorbed.
- Creating a Cooling System: Since high temperatures increase the chance of drive failure, it’s critical for data storage systems to design and test an advanced cooling system to allow for optimum airflow from front to back. A drive’s electronics are its “hot spot,” so when cooling mechanisms are paired with back-to-back positioning of the drives, the result is a cooling channel where a lower temperature is needed. Effective cooling systems also include a feature for ongoing temperature monitoring of not only the drives but other system components, with the ability for auto-adjustment of fan speed. In addition to these design considerations, the next step should involve rigorous testing of both temperature and airflow.
- Enabling Active Updates: Suffering downtime when system components require replacement can result in business grinding to a halt. To circumvent this problem, a storage system should be designed to allow “active updates”—meaning the IT administrator can replace components as the system continues humming along.
- Qualifying Software and Controlling Production Processes: To avoid failures in software protocol, it’s important to stick with a revision process that’s carefully managed. This can be achieved by validating each drive, refusing to accept software levels unless they undergo a rigorous process to qualify each disk.
- Testing the Drives: The best way to root out problematic drives from the get-go is to require testing as part and parcel of each storage system’s production process. If the test cycle flags too many sectors that need correction, that disk does not get qualified.
- Protecting Disks During Shipping/Handling: Intensive packing and handling procedures should be deployed to avoid damage that can occur during shipping—and can go undetected once received. A best practice is to use special shipping containers to ship hard drives outside the chassis.
An efficient and effective backup and disaster recovery strategy is the direct result of the painstaking efforts made by those responsible for its implementation and management. Likewise, disk drive reliability doesn’t just happen—it’s the result of intentional planning at the design, manufacturing and testing stages. When a storage system embodies the best practices in this trio of critical areas, the result will be efficiency, quality, and reliability that you can depend on.
I started my career on the storage industry in the mid 80’s as a cleanroom production engineer making “Winchester” 5 ¼” disk drives for IBM. The first disk I helped produce had the then jaw dropping capacity of 10MB! After multiple roles at IBM, I became the engineering manager for the resultant storage startup – Xyratex. As Xyratex’s business grew I became VP of Server and Enclosure Development, growing the engineering team to 150 engineers and project managers in UK, USA, India and Malaysia. We helped deliver five generations of industry leading hardware and software products. I joined Nexsan in 2014 as the Director of Product Development for the E-Series and BEAST high density storage products. As the block storage Product Manager I work closely with sales, marketing and our channel partners to ensure that the roadmap of new features/functions meets end user needs. I am now also focusing on our next generation of product which will provide even greater performance and enterprise functionality whilst maintaining our industry leading cost benefits
- The Best IT Resilience Platforms to Consider for 2021 and Beyond - September 16, 2021
- The Best Risk Management Courses on Udemy to Consider for 2021 - September 14, 2021
- The Best Veeam Tutorials on YouTube to Watch Right Now - September 14, 2021