Watch those backups

You can’t tell that I had to restore this blog a few days ago, and that’s a good thing.
Unless you were using this site during approximately 30 minutes late Sunday night, you wouldn’t have noticed a thing. That’s when I did my monthly patching, and something went wrong. I don’t know exactly what happened. Since this is a small site running on a single server, I haven’t bothered with much automation. Once a month, more or less, I log in, update the operating system to the latest version of Amazon Linux along with all the various other packages I use, then I reboot the server. It takes about 5 minutes and gives me an opportunity to log in and review things, which is something I want to do anyway, so as I said, automation doesn’t make a whole lot of sense.
This time, something didn’t work. After running my usual upgrade and rebooting, I was unable to connect to the EC2 instance. Health checks also failed. Rebooting again, and even stopping and restarting didn’t solve the problem.
Fortunately, one of the things I always do as part of my monthly update process is get a full snapshot of the root volume. I keep 3 versions of history.
First, a reminder: snapshots are not backups! While they may back up your data, they are generally not a sufficient backup strategy if you need any kind of granual recovery options. The volume (/html) that contains my WordPress installation and data, and the RDS database that contains more WordPress data both have more robust backup strategies. But for the root volumeof a system that isn’t updated often, that I could recreate completely in just a couple of hours, I judged regular snapshots that allow me to restore everything to a point in time (typically, just after the last successful update) to be adequate.
I re-created the root volume from the snapshot I took a bit over a month ago. With that in place, I started the instance and everything worked.
I followed up with granular system updates. First the operating system only, then each other package separately, rebooting after each update so in case of a failure, I’d know what had failed. Nothing did. I’m still not sure what happened the first time around.
When it was all done and I verified that everything was fine, I took a snapshot of the latest good environment and deleted the oldest snap, leaving me with the most recent three.
This prompted me to also review the snapshot/backup strategies for the /html volume and the database. I made a few changes to retention policies, but they’re largely unchanged. This site isn’t updated often and even if I lost a few days of work, it would likely only mean the loss of one or two postings that are easily recreated. As such, my backups are far less robust than I’d insist on for an enterprise application. As I noted when I documented the setup, I find it helpful to do things without automation from time to time, and a small environment like this is a good one to have for that kind of thing.
I also revisited the backups for my computers, NAS, and home network, and found a few holes that I patched. More on that in the next post.