Feedback Realm

Feedback

9 Characters

Site Incident Report - June 8, 2024

Submitted 4 months ago by

We've now fully recovered from an outage that began this morning Eastern time.


Timeline

9:57 PM (June 7)

  • We received an alert that the website was down.
  • Connectivity issues were intermittent.
  • The problem did not last long enough for any proper debugging to occur.

5:41 AM (June 8)

  • We received an alert that the website was down.
  • Notifications did not fire as expected, no one on our ops team was notified.
  • We sleep.
  • Based on our traffic data, the site came back online a few moments later.

11:41 AM

  • We received an alert that the website was down.
  • Investigations began.
  • Private networking between our hardware was failing when going to our primary database.
  • Servers were rebooted.

12:09 PM

  • Our ISP provided a notice that a hardware failure had occurred on one of our racks with the private networking equipment.
  • We opted to not failover the site to our secondary nodes as we had expected the failure to be resolved quickly.

5:19 PM

  • Our ISP indicated the hardware had been replaced.
  • Private networking between our hardware was working again.
  • Out of Games came online with 50% capacity.

5:27 PM

  • We started a reboot process on our hardware.
  • One of our servers no longer responded to ping requests.
  • Out of Games was online at 30% capacity.

5:49 PM

  • A rescue operation from our ISPs datacenter team started on the down server.

6:39 PM

  • Bad CMOS battery replaced on downed server.
  • Server booted successfully.
  • Out of Games remained at 30% capacity.

9:56 PM

  • Fixed an operating system issue that was preventing Out of Games services from communicating with each other.
    • Docker Swarm was having a fit with our networking cards.
    • This was a previous issue we encountered upon updating our docker software and had thought it was resolved.
    • Changes did not persist correctly on server reboots.

What We've Learned

  • Confirm whenever you make changes to your operating systems they will actually persist on reboots.
    • Just one of those things everyone should be more careful about. Document everything
  • We should have disabled CMOS checks on our hardware and instead let it simply log the issue.
    • This would have enabled our hardware to come back online without any real issue.

Also, the site was quite slow during Summer Game Fest, and I never got a chance to dive into it because of how much stuff was going on. I believe the cause of this was failing network equipment at our ISP.

Leave a Comment

You must be signed in to leave a comment. Sign in here.