How to Setup a Failure Tolerant Online Business in Six Steps—Step 4: Have Backup and Disaster Recovery Plans

 Originally published on December 05, 2011 by Dirk Paessler
Last updated on March 03, 2022 • 9 minute read

Dirk Paessler, CEOAt the Business of Software conference in Boston in October I hosted a workshop with the title "What do people do to keep their business online?" In this workshop I introduced the basic foundations of how we at Paessler run our business IT and what we do to make it failure tolerant. Those precautions are effectively what keeps our business online even as failures occur. And they will!

In this series of blog articles I will share with you six steps that will help make your business failure tolerant. — See also: The Complete Series

 

Step 4: Have Backup and Disaster Recovery Plans

Your IT infrastructure is secured? Good. But what about your data backup? And are there any plans in case of a business emergency?

Five-Level Backup

Since we have almost all our mission-critical IT on virtualized systems we can make use of the amazing backup technology available for VMs. The NetApp SAN can autonomously make and store snapshots of virtual machines. For the most important stuff we do this every night. Then we move these snapshots onto our LTO tape library every week and finally store a set of tapes off-site every few weeks.

For important systems we additionally make "file-system" based backups which allow for a second way to restore data in case the virtual machine image cannot be recovered.

As our third backup mechanism we use Crashplan which keeps an offsite copy of all really important stuff including source code on an offsite computer. We do not use Crashplan's cloud-based backup (where they store your backup data on their servers), which may be an option for others. Apart from Crashplan you can also look at BackBlaze as an attractive option.

This gives us five levels of backup.

Have a Disaster Recovery Plan

Think about bad things before they happen, so you're prepared.

If we would lose our office (plane crash, fire, volcano, etc.) all our (surviving) employees would simply go home and work from home using our VPN. Downtime: Zero.

If we would lose the colocation data center in Nuremberg we would buy an LTO drive in a computer shop together with 10 cheap PCs and some networking stuff. We would go to our office (or find a basement with an internet connection), set up the hardware, restore the most important virtual machines and use VMware on these cheap computers to run our infrastructure. Downtime of the office: 1-2 days. Downtime of the shop: Zero.

If we lost the Rackspace data center in Dallas, we would run website and shop from Nuremberg as mentioned before. Downtime: a few hours.

Are We Paranoid?

 

failure-tolerant-200021413-001.jpg
 

"Are you Paranoid?" That was one of the questions in the conference workshop in Boston. Well, maybe I am. But in 25 years of working with IT I have seen so many systems die, causing massive work spikes to recover data or recreate data.

We always have much more work than we could ever do at any time. Adding additional working caused by computer failures is just so senseless.

 

In the next blog post I will talk about how you can keep track of all your systems to make sure they're up and running at all times, including your redundancies.

 

The Complete Series

At Paessler we have been selling software online for 15 years and we have had hardware, software, and network failures just as everybody else. We tried to learn from each one of them and we tried to change our setup so that each failure would never happen again.

Read the other posts of this series: