Out of Cards Incident Report - 2021-03-10

Hey all,

Bringing in another incident report, yeah! we had a couple hours of downtime this morning due to issues with our hosting provider.

What makes this scary was when I went to sleep the night before knowing that OVH's SBG-2 data center burned down to the ground. Not kidding, it caught on fire and is a total heap of trash now. It also took out a part of SBG-1. SBG-3 and SBG-4 which are the other two parts of that deployment were taken offline and OVH is currently working on getting what is left of SBG-1 and 3+4 back online over the next 9 days.

Now, this fire didn't occur where we host Out of Cards so that wasn't an issue. But when your host is OVH and you wake up to a dead server, what do you think my first thought was?

Event Log

All times in Eastern.

7:52 AM - Alarms sound that the site is offline.

7:52 AM - Automated notification of a hardware fault on our server from our data center.

10:31 AM - Site is back online.

What Happened

It turns out your server goes offline when they decided to re-cable all the gear!

OVH went through and replaced "a large number of power supply cables" due to a potential problem with their insulation. The maintenance required was critical and also took out networking for some time on the rack our hardware is in which extended our downtime.

In an ideal situation, we'd have better failover protocols and hardware in place if something were to happen at our primary data center. Unfortunately, it isn't justifiable at this point in time to have that with the costs involved. We keep regular multi-location backups so that we can be up and running somewhere else if disaster strikes within a fairly short time frame, having containerized infrastructure helps quite a bit with the agility here, but there will always be some downtime if that happens. We lost some ad revenue and I know it was an inconvenience for people trying to access the site. I think we'll all live though!

Today's downtime gave us an opportunity to confirm our disaster recovery situation and that our backups work properly. If you were to learn something from this situation its that you shouldn't wait until a potential disaster strikes to make sure you have stuff in working order. Your backups are only good if they actually work!

I feel bad for everyone who lost critical information in the fires today. Sometimes it isn't even the fault of developers or the IT department for having poor backup practices. Sometimes your CEO just doesn't think it is necessary to spend money on a backup solution elsewhere to keep your data synced to at least one outside source. It's always fun when that same person who said it wasn't needed starts screaming and asking why there aren't backups in place when something really goes wrong and puts the blame back on them. We call those people assholes.

I don't recommend torching your data center as a method to wake up though.

Fluxflashor
Posted 4 years, 11 months ago

Hey all,
Bringing in another incident report, yeah! we had a couple hours of downtime this morning due to issues with our hosting provider.
What makes this scary was when I went to sleep the night before knowing that OVH's SBG-2 data center burned down to the ground. Not kidding, it caught on fire and is a total heap of trash now. It also took out a part of SBG-1. SBG-3 and SBG-4 which are the other two parts of that deployment were taken offline and OVH is currently working on getting what is left of SBG-1 and 3+4 back online over the next 9 days.
Now, this fire didn't occur where we host Out of Cards so that wasn't an issue. But when your host is OVH and you wake up to a dead server, what do you think my first thought was?
Event Log
All times in Eastern.
7:52 AM - Alarms sound that the site is offline.
7:52 AM - Automated notification of a hardware fault on our server from our data center.
10:31 AM - Site is back online.
What Happened
It turns out your server goes offline when they decided to re-cable all the gear!
OVH went through and replaced "a large number of power supply cables" due to a potential problem with their insulation. The maintenance required was critical and also took out networking for some time on the rack our hardware is in which extended our downtime.
In an ideal situation, we'd have better failover protocols and hardware in place if something were to happen at our primary data center. Unfortunately, it isn't justifiable at this point in time to have that with the costs involved. We keep regular multi-location backups so that we can be up and running somewhere else if disaster strikes within a fairly short time frame, having containerized infrastructure helps quite a bit with the agility here, but there will always be some downtime if that happens. We lost some ad revenue and I know it was an inconvenience for people trying to access the site. I think we'll all live though!
Today's downtime gave us an opportunity to confirm our disaster recovery situation and that our backups work properly. If you were to learn something from this situation its that you shouldn't wait until a potential disaster strikes to make sure you have stuff in working order. Your backups are only good if they actually work!
I feel bad for everyone who lost critical information in the fires today. Sometimes it isn't even the fault of developers or the IT department for having poor backup practices. Sometimes your CEO just doesn't think it is necessary to spend money on a backup solution elsewhere to keep your data synced to at least one outside source. It's always fun when that same person who said it wasn't needed starts screaming and asking why there aren't backups in place when something really goes wrong and puts the blame back on them. We call those people assholes.
I don't recommend torching your data center as a method to wake up though.
h0lysatan
Posted 4 years, 11 months ago

"The numbers Mason, what do they mean?" CoD.
Jokes aside, in my 3 years experience as IT (not server directly, but maintenance over clients), all server should have at least disaster prevention tools. In this case, fire extinguisher. Lots of em. (Can't really use water to put out fire in electronic as my supervisor used to say).
Cheese
Posted 4 years, 11 months ago

That fire happened barely miles away from where I live. Kinda weird to think it had such far-reaching consequences.
og0
Posted 4 years, 11 months ago

If you were to learn something from this situation its that you shouldn't wait until a potential disaster strikes to make sure you have stuff in working order.
Totally true. After 20 years in sysadmin I can testify to that. We had 2 Disaster Recovery sites with the main system sync'd to DR1 and DR2 every 5 minutes. I designed and implemented it, wrote the Disaster Recovery documents and carried out a full test every year where I basically shut down the central server room and we ran off the backup systems for a day.
Separately test the primary UPS power backup and auxiliary generator.
Only way to be sure, and let the techs and devs know what to do. Ah good days (I'm retired now).
frenzy
Posted 4 years, 11 months ago

I just pop a copy on my flash drive - it'll be fine
GameTheory345
Posted 4 years, 11 months ago

I've been reading up on some forums, and I felt the need to say this because nobody seems to understand: data is expensive. You can't just "save all the data in a bunch of data centres" because that costs too much. A 1 terabyte hard drive already runs upwards of $140, and you also have to pay for security and bandwidth and a whole bunch of stuff. If people could save their data everywhere, then they would.
EDIT: Should probably point out that I don't work in IT. My knowledge comes from my friends that do.

Fluxflashor
Posted 4 years, 11 months ago

Quote From GameTheory345
I've been reading up on some forums, and I felt the need to say this because nobody seems to understand: data is expensive. You can't just "save all the data in a bunch of data centres" because that costs too much. A 1 terabyte hard drive already runs upwards of $140, and you also have to pay for security and bandwidth and a whole bunch of stuff. If people could save their data everywhere, then they would.
EDIT: Should probably point out that I don't work in IT. My knowledge comes from my friends that do.
Relatively speaking, data storage isn't that expensive, neither is bandwidth, compute sure as hell is though.
It is cheaper, in the long run, to run your own drives and hardware RAID in a couple of different places if we're just talking about files, but you're better off using an object store like Amazon's AWS S3. They worry about all the stuff that can go wrong and your data is built with the loss of two data centers in mind. (The odds of 3 data centers losing the same files in three places at the same time? I'll take that risk). Even with that in mind though, anything critical you are storing on S3 you should also mirror somewhere else. It could be an on-premises NAS or Google / Azure clouds - never trust a single service.
If you need a database though, that's when things can get pricy. Depending on the workloads you're dealing with, you're going to need comparable hardware in your second location if you want to do a real-time replica (which ensures little to no data loss when the primary fails). You'll also need to do backups (at least daily) to dump into storage but storage itself is cheap so that's not a big deal.
If your business relies on its data, it is in your best interest to make sure you can recover when a failure happens. Failures are ALWAYS a when and not if situation. You can be as proactive as you want but even that is not enough to give you a 100% guarantee that you are safe. If a business can't afford the cost to make regular backups of their essential data, I don't think they're doing a very good job as a business.
The only thing that Out of Cards doesn't have is the ability to instantly switch the site onto a fresh set of hardware. There would be some downtime involved since our compute isn't distributed. I don't consider it essential at this point, though it would be nice, because we're talking about a relatively small amount of downtime. I'd prefer us to have zero, but that's a spot I'm willing to penny-pinch on for now.
In an ideal setup, Out of Cards would be distributed in a few locations and we'd run Chaos Monkey.
https://github.com/Netflix/chaosmonkey
I've always wanted to run infrastructure with that running.

Event Log

What Happened

Comments

Event Log

What Happened

Leave a Comment