Disaster Recovery Implementation: Four Key Steps to Success
This is part of Solutions Review’s Premium Content Series, a collection of contributed columns written by industry experts in maturing software categories. In this submission, SIOS Technology Solutions Architect Ian Allton outlines the four keys to disaster recovery implementation, as well as 7 steps to proper business continuity planning.
Establishing a disaster recovery plan could be a stressful task for IT and database administrators. There are countless options available and many things to consider. Yet without a solid plan, the organization risks losing valuable data. This article will cover some practical guidance to help anyone who is tasked with establishing business continuity (BC) and disaster recovery (DR) plans.
Business Continuity Plan
No business continuity or disaster recovery plan can tackle every possible event or set of circumstances and, for that reason, both business continuity and disaster recovery should evolve continuously. The following steps are helpful for creating and strengthening your business continuity plans. Keep in mind that the business continuity plan will be the foundation for disaster recovery implementation.
- Step 1: Prepare before Planning – Gather information about key personnel, customers, facilities, operating procedures, etc. If the business depends on it for anything critical to operations, it should be included.
- Step 2: Define objectives – The business continuity plan requires a set of objectives, which align with the company’s core mission, based on an assessment of possible disruptions.
- Step 3: Identify potential threats – Determine priorities and estimate the potential duration of likely threats based on the organization’s locations and circumstances.
- Step 4: Business continuity strategies – A BC plan should always include ways to minimize business impacts before, during, and after recovery from disruption.
- Step 5: Secure teams and tasks – Establish a line of succession with alternate members or teams should the primary ones be unavailable.
- Step 6: Test the plan – Use scheduled power outages or major upgrades as a chance to test the plan. Some tests could also occur unannounced.
- Step 7: Enhance the plan – Adjust, update, or otherwise maintain the plan based on what you learned during the tests and actual disruptions.
Disaster Recovery Plan
The disaster recovery planner should recognize the distinction between failures and disasters as they evaluate the different solutions needed for high availability (HA) and disaster recovery. A key distinction involves the location of redundant resources and whether you want to failover operation to them or simply make a copy (replication) of them. You can recover from a failure by using clustering software to failover application operation from a primary server node to a secondary server node over a LAN. On the other hand, recovering from a disaster requires more geographic separation typically over a Wide Area Network.
Advanced failover clustering environments use failover and replication in unison to ensure application operation moves to a remote node and continues to operate using up-to-date storage data located in a geographically distant node.
These facts lead to differences in the Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) established for high availability and disaster recovery purposes. RTO is the maximum tolerable length of time of an outage. RPO is the amount of data that you can afford to lose in the event of an outage. Critical applications have low RTOs, normally of a few seconds, and RPOs of zero.
Disaster Recovery Options
Understanding the factors mentioned in the previous section, IT and database administrators have a wide range of options when choosing disaster recovery solutions for applications.
While HA and DR are different, it is possible and preferable to add disaster recovery to an existing high availability configuration. There are two popular options for combining high availability and disaster recovery solutions for SQL Server: SQL Server’s own Always On Availability Groups feature and purpose-built failover clustering software.
Always On Availability Groups (AOAG) in SQL Server delivers rapid, automatic failovers for HA, and protects against widespread disasters with minimal or no data loss. However, AOAG is not available in SQL Server Standard Edition. It requires the more expensive SQL Server Enterprise Edition and does not protect applications other than SQL Server.
Purpose-built failover clustering solutions support virtually all applications running on Windows Server and Linux in public, private, and hybrid clouds. Implemented entirely in software, they include real-time data replication, continuous monitoring for detecting failures, and customizable policies for failover and failback. Advanced clustering solutions include application-specific recovery kits designed to simplify the complexity of configuration and ensure reliable failovers in compliance with application best practices.
Adding Disaster Recovery to a High Availability Failover Cluster
Any application that requires high availability also needs the ability to recover from a widespread disaster. One way is to add data protection to an existing HA failover cluster. A combined solution is easier to manage and test and more easily facilitates routine hardware and software upgrades.
The cluster can include two SQL Server nodes configured in two different availability zones in a public cloud or one node on-premises and with a second in the cloud. Local storage data is replicated from the primary node to the secondary, geographically distant node.
Using a purpose-built failover clustering solution can increase costs. But that cost is easily offset by the cost of downtime, the more expensive SQL Server Enterprise Edition, and the added labor cost of complex maintenance.