Out of Cards Incident Report - 2020-10-17

Submitted 4 years, 8 months ago by Fluxflashor

Hey all,

We had some unintended downtime this way-too-early morning. Because this kind of stuff always interests me, and transparency can be fun, let's see what went wrong!

Event Log

All times in Eastern.

2:43 AM - Out of Cards goes offline.

2:49 AM - A service unreachable email gets fired off.

This email did not trigger the alerts that it should have. Since we did a site migration a couple of months ago, our tech for that changed and was never tested. Pfft, downtime. That'll never happen!

3:01 AM - The messages begin to hit me on Discord and Twitter.

See, you don't need proper downtime checks!

3:03 AM - Investigation Begins

3:19 AM - Problem Discovered

Our container service decided it wanted to do an automatic update. That really should not have been an issue but because the containers running our site were not set to automatically restart, the site went down.

3:22 AM - Site comes back online

Updates for the service have been disabled - we'd rather manage these ourselves anyway.

What We're Changing

Obviously, our big issue is that our app containers didn't restart themselves. That's shitty and a huge oversight. A configuration change will go out in the morning that will resolve this.

The secondary issue is garbage monitoring of the site. There should be some serious alarms going off when stuff goes wrong. I'll make sure we have better processes in place going forward to notify of downtime.

Hopefully there isn't a next time!

1

Fluxflashor CEO 2025 3148 Posts Joined 10/19/2018

Posted 4 years, 8 months ago

Hey all,

We had some unintended downtime this way-too-early morning. Because this kind of stuff always interests me, and transparency can be fun, let's see what went wrong!

Event Log

All times in Eastern.

2:43 AM - Out of Cards goes offline.

2:49 AM - A service unreachable email gets fired off.

This email did not trigger the alerts that it should have. Since we did a site migration a couple of months ago, our tech for that changed and was never tested. Pfft, downtime. That'll never happen!

3:01 AM - The messages begin to hit me on Discord and Twitter.

See, you don't need proper downtime checks!

3:03 AM - Investigation Begins

3:19 AM - Problem Discovered

Our container service decided it wanted to do an automatic update. That really should not have been an issue but because the containers running our site were not set to automatically restart, the site went down.

3:22 AM - Site comes back online

Updates for the service have been disabled - we'd rather manage these ourselves anyway.

What We're Changing

Obviously, our big issue is that our app containers didn't restart themselves. That's shitty and a huge oversight. A configuration change will go out in the morning that will resolve this.

The secondary issue is garbage monitoring of the site. There should be some serious alarms going off when stuff goes wrong. I'll make sure we have better processes in place going forward to notify of downtime.

Hopefully there isn't a next time!

Founder, Out of Games

Follow me on Twitch and Twitter.
If you are planning on playing WoW on US realms, consider using my recruit link =)

Remove Ads - Go Premium

1

Updates

Event Log

What We're Changing

Event Log

What We're Changing

Leave a Comment