Site Incident Report - June 8, 2024
Submitted 4 months ago by
Fluxflashor
We've now fully recovered from an outage that began this morning Eastern time.
Timeline
9:57 PM (June 7)
- We received an alert that the website was down.
- Connectivity issues were intermittent.
- The problem did not last long enough for any proper debugging to occur.
5:41 AM (June 8)
- We received an alert that the website was down.
- Notifications did not fire as expected, no one on our ops team was notified.
- We sleep.
- Based on our traffic data, the site came back online a few moments later.
11:41 AM
- We received an alert that the website was down.
- Investigations began.
- Private networking between our hardware was failing when going to our primary database.
- Servers were rebooted.
12:09 PM
- Our ISP provided a notice that a hardware failure had occurred on one of our racks with the private networking equipment.
- We opted to not failover the site to our secondary nodes as we had expected the failure to be resolved quickly.
5:19 PM
- Our ISP indicated the hardware had been replaced.
- Private networking between our hardware was working again.
- Out of Games came online with 50% capacity.
5:27 PM
- We started a reboot process on our hardware.
- One of our servers no longer responded to ping requests.
- Out of Games was online at 30% capacity.
5:49 PM
- A rescue operation from our ISPs datacenter team started on the down server.
6:39 PM
- Bad CMOS battery replaced on downed server.
- Server booted successfully.
- Out of Games remained at 30% capacity.
9:56 PM
- Fixed an operating system issue that was preventing Out of Games services from communicating with each other.
- Docker Swarm was having a fit with our networking cards.
- This was a previous issue we encountered upon updating our docker software and had thought it was resolved.
- Changes did not persist correctly on server reboots.
What We've Learned
- Confirm whenever you make changes to your operating systems they will actually persist on reboots.
- Just one of those things everyone should be more careful about. Document everything.
- We should have disabled CMOS checks on our hardware and instead let it simply log the issue.
- This would have enabled our hardware to come back online without any real issue.
Also, the site was quite slow during Summer Game Fest, and I never got a chance to dive into it because of how much stuff was going on. I believe the cause of this was failing network equipment at our ISP.
Leave a Comment
You must be signed in to leave a comment. Sign in here.