Hey all,
Bringing in another incident report, yeah! we had a couple hours of downtime this morning due to issues with our hosting provider.
What makes this scary was when I went to sleep the night before knowing that OVH's SBG-2 data center burned down to the ground. Not kidding, it caught on fire and is a total heap of trash now. It also took out a part of SBG-1. SBG-3 and SBG-4 which are the other two parts of that deployment were taken offline and OVH is currently working on getting what is left of SBG-1 and 3+4 back online over the next 9 days.
Now, this fire didn't occur where we host Out of Cards so that wasn't an issue. But when your host is OVH and you wake up to a dead server, what do you think my first thought was?
Event Log
All times in Eastern.
7:52 AM - Alarms sound that the site is offline.
7:52 AM - Automated notification of a hardware fault on our server from our data center.
10:31 AM - Site is back online.
What Happened
It turns out your server goes offline when they decided to re-cable all the gear!
OVH went through and replaced "a large number of power supply cables" due to a potential problem with their insulation. The maintenance required was critical and also took out networking for some time on the rack our hardware is in which extended our downtime.
In an ideal situation, we'd have better failover protocols and hardware in place if something were to happen at our primary data center. Unfortunately, it isn't justifiable at this point in time to have that with the costs involved. We keep regular multi-location backups so that we can be up and running somewhere else if disaster strikes within a fairly short time frame, having containerized infrastructure helps quite a bit with the agility here, but there will always be some downtime if that happens. We lost some ad revenue and I know it was an inconvenience for people trying to access the site. I think we'll all live though!
Today's downtime gave us an opportunity to confirm our disaster recovery situation and that our backups work properly. If you were to learn something from this situation its that you shouldn't wait until a potential disaster strikes to make sure you have stuff in working order. Your backups are only good if they actually work!
I feel bad for everyone who lost critical information in the fires today. Sometimes it isn't even the fault of developers or the IT department for having poor backup practices. Sometimes your CEO just doesn't think it is necessary to spend money on a backup solution elsewhere to keep your data synced to at least one outside source. It's always fun when that same person who said it wasn't needed starts screaming and asking why there aren't backups in place when something really goes wrong and puts the blame back on them. We call those people assholes.
I don't recommend torching your data center as a method to wake up though.