IT Purchase Process / Enterprise / Cybersecurity / Networking
The Ultimate DRaaS (Disaster Recovery) Guide
Disaster Recovery has been a critical element of most enterprise IT plans for decades, but differences in resources and planning depth have allowed some organizations to be better prepared for disasters than others. Historically, larger enterprises have gone as far as completely replicating their core data center infrastructure and running active/active or active/passive redundancy to a second, but entirely as capable, DC facility. This strategy, while ultimately effective, is extremely expensive and in many cases not necessary.
Outsourcing disaster recovery, and prescribing to it “as a service” (hence, DRaaS) is still a relatively new concept, but one that is gaining traction quickly due to its flexible, adaptable and turn-key nature. DRaaS does not require an organization to stand up a second data center or even buy physical servers, as all of these services will be managed by the DR provider.
Today’s post is going to examine the critical elements of successful DR outsourcing, identify a few of the major players, and call out common pitfalls in deployment.
What is disaster recovery anyway?
In the same way the physical world is prone to natural disaster, the digital world is as well. In fact, not only can an IT disaster occur as a result of something going wrong in the digital domain, physical and natural disasters often impact digital assets as well. Beyond human capital, an organization’s IT infrastructure and access to applications are often house its most valuable assets. Network downtime costs are measured in thousands, tens of thousands or even hundreds of thousands of dollars per MINUTE; a hardware failure, fiber cut, power outage, flood, hurricane, or even cyber-attack has the potential to cause massive damage to an enterprise’s bottom line and ability to operate and effectively serve customers.
IT leaders today are expected to provide 100% availability of all IT resources and anything less is often perceived as a failure - unfortunately regardless of circumstance. At the same time, IT leaders are constantly under a “do more with less” mandate that often requires hard choices and ultimately tough compromises to be made.
Disaster recovery involves a combination of software and virtual or physical hardware that can provide access to IT resources and applications in the event that an organization's primary data processing center becomes unavailable.
DR is much more than just a backup or replication service as it provides all of the necessary infrastructure, configurations, and ultimately functionality in order to completely replace the primary processing facility in the event of a disaster. When outsourcing this task to a third-party provider, you get DRaaS.
Diagram illustrating DRaaS ability to automate both failover and failback.
What makes DRaaS providers so much more efficient?
...why can they do this cheaper than I can?
There are two primary reasons. The first is that DRaaS providers do not run active/active services to their end users, so unless you are actually in the midst of a disaster event, then you are not utilizing any of their computing resources.
The second is great software, which dynamically enables dormant computing resources to be leveraged by customers who actually are experiencing some sort of disaster event. This sort of shared infrastructure and elastic computing power paired with the ability to virtualize multiple servers that share resources (even between customers) allows the DRaaS provider to service many customers, with less physical infrastructure in an on-demand nature.
What is critical to understand and evaluate is what a worst case scenario might look like to the provider itself: can they handle a major disaster that would dramatically scale their need for underlying resources? Are their customers all in the same geography? How is their infrastructure distributed geographically? These are all examples of good questions to talk through with providers being considered.
How best to start a DRaaS project?
The first step is to identify threats. Some threats cannot be anticipated, but if you are a company that is operating in South Florida, hurricanes are something that you need to think about for example. When planning your DR strategy, choosing a DRaaS provider whose infrastructure sits a few hundred miles away in Orlando, or even the other side of the state in Tampa is probably not going to be a smart decision. Surprisingly, many companies were previously forced to keep their DR sites relatively closeby, as an actual IT person would likely need to be on site in the backup facility to facilitate data restoration or resource access. Having an outsourced DR provider rids companies of this need thankfully.
Second, understand business needs in the event of a failure. This can be tricky and often goes beyond a typical IT exercise. Business leaders need to be challenged to assess the impact of losing access to systems and what reasonable goals should be for restoration post-disaster. Many companies already adopt public cloud SaaS solutions like Salesforce and G-Suite where data and application access is already resilient without any extra work. Outsourcing to SaaS providers is a disaster recovery solution in itself, but beware that even these tools have the capability of admin level mass overwrites and potential data losses due to user error. A proper DRaaS can mitigate these sorts of losses. Some organizations may just have a need for data backup, or (you guessed it) BaaS - which would keep critical data replicated or at least routinely copied and and available offsite, without an ability to provide application access to end users. Maybe your company can operate with just BaaS, or maybe not.
At the end of the day, comparing feature functionality between DRaaS solutions often comes down to a few core factors:
Backup processes and capabilities: For data, applications and virtual machines; how often will data be backed up, how and where will it be stored and what else, if anything, in addition is needed to restore.
Recovery options: Identifying a failure, identifying a restore point, identifying the related applications and infrastructure and then initializing the actual failover process.
Protection for various disaster scenarios: An ability to recover from both foreseeable and unforeseeable disasters, both natural and ones native to the digital domain. (Power failure, flood, hardware failure, cyber attack, etc)
Ability to meet your specific continuity objectives: This should be defined by the business with help from IT and is measured by a recovery point objective (RPO) and recovery time objective (RTO).
Cost and complexity of deployment and management: This ultimately boils down to how much of the DR solution the IT organization wants to be responsible for and control. This can be a comprehensive responsibility for setup, management and actual usage during a disaster event, to completely outsourcing it to a 3rd party.
Once you have a scope, choose a flavor
Like most SaaS solutions, DRaaS comes in a few different flavors depending on user need. The first is public cloud DRaaS, and these tools are available to opt into via AWS, Azure, and GCP. These are software utilities that allow IT administrators to set up and manage their own DR environments within your existing public cloud.
The second type is a private, cloud-based DRaaS, which is conceptually similar to the public cloud tools but differs in that they are designed to be implemented in private cloud environments.
The third, and we believe most appealing, flavor is a fully outsourced infrastructure AND management DRaaS solution from a company that specializes in offering these services. The options here are endless and vary from household names like IBM, to regional cloud hosting providers.
Zerto’s cloud based platform works across public, private and hybrid clouds.
Defining RTO and RPO - the underlying business goals behind the SLA
Setting recovery time objectives (RTO) and recovery point objectives (RPO) are part of the business decision that IT leaders need to define. In both cases, lower numbers are better, but are also more costly. This is the area of DRaaS planning where you identify compromises and align them to commercial expectations and budget.
RTO is defined as the maximum amount of time that an application may be unavailable to an organization during an outage, essentially how quickly your DRaaS provider will be able to spin up your infrastructure and get it working with an ability to take over the needs of the production environment. This is typically measured in minutes, but should be based on company needs. (Best in class solutions currently measure this in minutes)
RPO is a measurement of the actual period of data loss that a company is willing to take on during a disaster event. This often correlates to RTO, but often does not match it exactly. This too has commercial implications as a DRaaS provider running backups every 30 minutes is less intensive than a constant real time replication of data. (Best in class solutions currently measure this in seconds)
Companies will need to carefully assess their RTO and RPO goals and their cost implications, eventually choosing a plan that matches company needs without undermining the benefits of DRaaS.
Actually implementing it!
After business discussions have taken place and an organization has identified their RTO and RPOs, there often tends to be an urgency to sprint and get the new DR solution in place. This doesn't necessarily have to be the case - organizations can move slowly, migrate critical applications one at a time, while testing and vetting their DRaaS provider in a more phased manner. This is why planning and executing with plenty of time is often critical to the project's success. Starting a DR project from scratch in June to try and beat the season's first hurricane is less than ideal!
As with most IT projects, the heaviest lift usually comes at the beginning of the project with initial replication of the data or moving data from the primary production environment to the DRaaS facility. This process can take a long time and overwhelm your WAN connections. There are several ways to mitigate this including replicating only during off-peak hours and waiting days or weeks to get everything transferred, or in many cases copying the data onto physical drives that reside on the LAN and mailing those to your DRaaS provider who then replicates the data on their facilities.
Ultimately, the amount of data that needs to be transferred and how robust the WAN link(s) are in your datacenter will help drive the decision on which strategy is best.
Managing, testing and proving solution readiness and efficacy
Common sense would tell you that you do not want to have the first test of your DRaaS solution come during a time where you are actually in a disaster scenario.
Regardless of how well prepared you are for an outage or unforeseen event, these times come with extremely high levels of stress and anxiety and true preparedness comes only with having gone through the motions with your provider in advance so you know exactly what to expect. As with everything in IT, a single test is never adequate, and actual real world testing of all backup systems needs to be written into the disaster recovery plan and performed periodically.
While DRaaS solutions, especially fully managed ones, are cost effective, relatively easy to set up, test, and manage, there are some gotchas that IT professionals need to be aware of.
We already touched on this when discussing the initial heavy lift of moving data from the production environment to the DR environment, and while the initial replication of the data can be the biggest challenge, there is also going to be an increased load on the supporting network infrastructure in order to move data back and forth between the primary and DR site. You need to ensure that you have robust bandwidth for all of the new egress traffic and that latency and packet loss between the two sites are both acceptably low. While these requirements are not quite as sensitive as they’d be for a VoIP call, or other real time communications, there can be problems if your WAN link is not up to snuff.
It is critical to understand what data is actually leaving your private facility and moving to an external site. Compliance forms like PCI, HIPAA, or others that surround PII data have to be taken into account and your provider of choice needs to be capable of storing sensitive information in a way that meets whatever compliance requirements are relevant. There can also be complications when moving data across states lines or into foreign countries. California for instance is the first state to introduce regulations around PII data that are similar to what the EU has done with GDPR. Failing to meet these requirements can have wide ranging consequences from negative PR, fines, or even criminal sanctions.
We have covered testing already, but it deserves to be raised again as lack of it is one of the most common gotchas that can arise when activating a DR plan. It is easy to test once, get the intended result and not test again until something unanticipated actually arises. Testing is something that needs to be carefully orchestrated and planned often. Testing includes both failover over to the DR site, as well as failing back. RTOs and RPOs should be measured, and tests should ideally be performed both as planned failover, and surprise or unplanned failover. Your DRaaS provider should provide a clear roadmap of all the testing options that are available and what the cost is to run them.
Veeam has been a leader in the DRaaS space since the very beginning, and is an underlying software platform and orchestration tool that can be managed by an end user or a service provider to enable DRaaS functionality. Veeam can be deployed to work across public and private clouds and provides robust functionality, configurability and reporting. Additionally, Veeam is designed with interoperability to major backup and storage systems such as Dell/EMC, ExaGrid, NetApp and Cisco. One of Veeam’s greatest attributes is that it is widely considered to be the easiest and most straightforward DRaaS product on the market.
Zerto is another major force in the world of DRaaS management platforms and can support both VMWare and Microsoft Hyper-V environments (and even automates conversion between the two). Zerto specifically calls out its agnostic nature and allows users to seamlessly switch between Azure, AWS, GCP, and other public cloud providers, and goes to market with a pay-as-you-go model for its Virtual Replication service which automates implementation, testing, and ongoing management.
Datto differs from both Zerto and Veeam in that it is an MSP-centric solution. Datto does not work with end users directly, and most end users do not even have access to the management console. Datto also requires their own proprietary appliances while Veeam and and Zerto can run on off-the-shelf hardware. Datto is designed to work with less configuration out-of-box but offers functionality like simultaneous replication of data both to their onsite appliance as well as their cloud.
Top Managed DRaaS Providers
A great way to roll out a DRaaS solution is to use a managed services provider to help design and scope the project. There are many great providers out there that do an amazing job designing and implementing DRaaS solutions. Additionally, many MSPs offer multiple DraaS technologies as part of their offering and can provide unbiased advice on which platforms align best to a specific organization's needs and use cases. A few examples of companies like this are Rackspace, EvolveIP, Flexential, Tierpoint and RapidScale.
What’s more is that these companies also provide comprehensive suites of parallel solutions which can complement and augment an effective DRaaS offering.
A successful DR solution that can be executed with confidence in the event of an unforeseen event is one the most important elements of modern IT strategy. Being able to opt into this as a service, which is economical, reliable, and easy to implement and manage makes it possible for organizations of all sizes to ensure business continuity and for IT professionals to sleep a little more soundly at night.
If you still have questions about DRaaS, how to define your project and ensure a successful implementation strategy, schedule an appointment with us or send us an email at [email protected] and we will be more than happy to set up a time to talk.