Why you should be paranoid with web server backups
Imagine the worst case scenario. Your biggest client's office is on fire and their servers are literally going up in flames. But you have multiple backups, right?
One thing that working with public-facing live websites has taught me is that it’s important to have a clearly defined strategy for recovering when things go wrong. That includes multiple levels and complexities of backups for fallback options in case the worst scenario happens.
I’m going to touch on some simple best practices for backups, and any precautions to take before making changes that could have unintended consequences. In a perfect world we would all have redundant fallback servers in separate environments for every web server in production, but not everyone’s budget can accommodate this.
Before making changes
OK, so a client wants a change made that could have far-reaching and potentially negative consequences. Before you touch anything, go through all your backups (hopefully you have a variety of snapshots, and onsite and offsite backups) and make sure everything has gone through successfully on the last cycle.
DON’T TRUST ANY MONITORING SOFTWARE YOU MAY HAVE — GET EYES ON EVERYTHING!
Are there any sites hosted on the server that is changing? If so, make sure to make the appropriate people aware of a content freeze in case a restore is needed.
Assuming everything went well above and looks shipshape, the next step is to take a snapshot in whatever virtualization solution you’re using. If your servers aren’t virtualized, well, that’s a topic for a whole other blog post, but you should get on that. And consider chain reactions. Once that snapshot is complete, make sure that server isn’t hosting anything that reaches out to any other servers for databases (or similar) that could also be impacted. And snapshot those as well if required. Now you’re ready to actually make that change for your client!
Everything has gone horribly wrong!
For entertainment’s sake, we’re going to assume everything has gone horribly wrong at this point. Hopefully nothing’s literally on fire. As mentioned above, it’s nice to have multiple backup solutions going with different restore times.
Snapshots are great to fall back on if everything is completely destroyed after a change, and they’re quick to restore. However, it’s often the case where after a change is made everything looks great at initial inspection. But weeks or months down the line someone notices comments or graphics are missing, and enough changes have been made by content authors in the meantime time that it’s not feasible to do a snapshot restore.
This is where onsite image and/or file backups come into play. My personal preference is for solutions like Acronis VMProtect that grab full images of your virtual machines, which are restorable. They also allow you to restore individual files, in a grandfather-father-son backup scheme, which means there are usually three or more separate backup cycles: daily weekly and monthly.
It’s also a best practice to have these backups going to a local NAS (or similar) that you can restore from pretty much instantly, and then mirrored offsite in case of disasters, such as fire etc. Never have all your eggs in one basket, or in this case all your backups in one data centre.
You’ve recovered from disaster. Now what?
One thing a lot of companies seem to do after things have gone wrong is point the finger, which is not the right approach. If any downtime was caused by a request from a client, inform them in a diplomatic way. Hopefully you’ve advised them this was a possibility when they made the request. If you didn’t, that’s on you — admit you should have informed them.
Now the big kicker is when downtime was caused by a mistake made by you or a member of your team. This is when complete transparency is most important. If you’re less than completely honest it will come out at some point, and then you’ve lost trust with your client, which is difficult to regain. On the other hand, if you’re upfront about it, and tell your client what happened, how you recovered from it, and how you plan to ensure it never happens in the future generally be impressed you were able to recover from it so quickly and understand that no infrastructure is completely bulletproof. Honesty and integrity are every bit as important as solid backup and recovery methods.
Are there any methods or apps you’ve found useful in disaster recovery? Leave your tips below.