Site Incident Report - June 8, 2024

We've now fully recovered from an outage that began this morning Eastern time.

Timeline

9:57 PM (June 7)

We received an alert that the website was down.
Connectivity issues were intermittent.
The problem did not last long enough for any proper debugging to occur.

5:41 AM (June 8)

We received an alert that the website was down.
Notifications did not fire as expected, no one on our ops team was notified.
We sleep.
Based on our traffic data, the site came back online a few moments later.

11:41 AM

We received an alert that the website was down.
Investigations began.
Private networking between our hardware was failing when going to our primary database.
Servers were rebooted.

12:09 PM

Our ISP provided a notice that a hardware failure had occurred on one of our racks with the private networking equipment.
We opted to not failover the site to our secondary nodes as we had expected the failure to be resolved quickly.

5:19 PM

Our ISP indicated the hardware had been replaced.
Private networking between our hardware was working again.
Out of Games came online with 50% capacity.

5:27 PM

We started a reboot process on our hardware.
One of our servers no longer responded to ping requests.
Out of Games was online at 30% capacity.

5:49 PM

A rescue operation from our ISPs datacenter team started on the down server.

6:39 PM

Bad CMOS battery replaced on downed server.
Server booted successfully.
Out of Games remained at 30% capacity.

9:56 PM

Fixed an operating system issue that was preventing Out of Games services from communicating with each other.
- Docker Swarm was having a fit with our networking cards.
- This was a previous issue we encountered upon updating our docker software and had thought it was resolved.
- Changes did not persist correctly on server reboots.

What We've Learned

Confirm whenever you make changes to your operating systems they will actually persist on reboots.
- Just one of those things everyone should be more careful about. Document everything.
We should have disabled CMOS checks on our hardware and instead let it simply log the issue.
- This would have enabled our hardware to come back online without any real issue.

Also, the site was quite slow during Summer Game Fest, and I never got a chance to dive into it because of how much stuff was going on. I believe the cause of this was failing network equipment at our ISP.

Feedback

Site Incident Report - June 8, 2024

Timeline

What We've Learned

Leave a Comment

Comments