Out of Cards Incident Report - 2020-10-17
Hey all,
We had some unintended downtime this way-too-early morning. Because this kind of stuff always interests me, and transparency can be fun, let's see what went wrong!
Event Log
All times in Eastern.
2:43 AM - Out of Cards goes offline.
2:49 AM - A service unreachable email gets fired off.
This email did not trigger the alerts that it should have. Since we did a site migration a couple of months ago, our tech for that changed and was never tested. Pfft, downtime. That'll never happen!
3:01 AM - The messages begin to hit me on Discord and Twitter.
See, you don't need proper downtime checks!
3:03 AM - Investigation Begins
3:19 AM - Problem Discovered
Our container service decided it wanted to do an automatic update. That really should not have been an issue but because the containers running our site were not set to automatically restart, the site went down.
3:22 AM - Site comes back online
Updates for the service have been disabled - we'd rather manage these ourselves anyway.
What We're Changing
Obviously, our big issue is that our app containers didn't restart themselves. That's shitty and a huge oversight. A configuration change will go out in the morning that will resolve this.
The secondary issue is garbage monitoring of the site. There should be some serious alarms going off when stuff goes wrong. I'll make sure we have better processes in place going forward to notify of downtime.
Hopefully there isn't a next time!
Leave a Comment
You must be signed in to leave a comment. Sign in here.
Hey all,
We had some unintended downtime this way-too-early morning. Because this kind of stuff always interests me, and transparency can be fun, let's see what went wrong!
Event Log
All times in Eastern.
2:43 AM - Out of Cards goes offline.
2:49 AM - A service unreachable email gets fired off.
This email did not trigger the alerts that it should have. Since we did a site migration a couple of months ago, our tech for that changed and was never tested. Pfft, downtime. That'll never happen!
3:01 AM - The messages begin to hit me on Discord and Twitter.
See, you don't need proper downtime checks!
3:03 AM - Investigation Begins
3:19 AM - Problem Discovered
Our container service decided it wanted to do an automatic update. That really should not have been an issue but because the containers running our site were not set to automatically restart, the site went down.
3:22 AM - Site comes back online
Updates for the service have been disabled - we'd rather manage these ourselves anyway.
What We're Changing
Obviously, our big issue is that our app containers didn't restart themselves. That's shitty and a huge oversight. A configuration change will go out in the morning that will resolve this.
The secondary issue is garbage monitoring of the site. There should be some serious alarms going off when stuff goes wrong. I'll make sure we have better processes in place going forward to notify of downtime.
Hopefully there isn't a next time!
Founder, Out of Games
Follow me on Twitch and Twitter.
If you are planning on playing WoW on US realms, consider using my recruit link =)