Bbabo NET

Science & Technology News

GitHub reported on the causes of the problems on November 27

On the evening of November 27, problems began when working with the git client, and the GitHub site (error 500) was not available, including github pages.

GitHub spoke about the reasons for the global outage in the November service. The incident affected all major GitHub services, including GitHub Actions, API Requests, Codespaces, Git Operations, Issues, GitHub Packages, GitHub Pages, Pull Requests, and Webhooks. GitHub specialists were able to solve the problem after 2 hours and 50 minutes. The cause of the incident is a new unexpected failure while processing the migration schema for a large MySQL table.

Initially, the migration process went as planned. In its final phase, a renaming was planned to move the updated table to the correct location.

It was at this stage of the migration that a significant part of MySQL read replicas (read replicas) got into a dead end of semaphores (Semaphore is a combination of a lock and a thread counter).

MySQL GitHub Clusters consist of a master node for recording traffic, multiple read replicas for production traffic, and multiple read replicas that serve internal traffic for backup and analytics purposes.

Ultimately read - GitHub MySQL replicas that got into a deadlock (state) went into a disaster recovery state, which increased the load on the rest of the working read replicas. Due to the cascading nature of this scenario, there were not enough active read replicas in the system to process current production requests, which significantly affected the availability of core GitHub services. moved all available internal replicas that were healthy to the production node. As it turned out, this was not enough to solve the problem. In addition, read replicas serving production traffic began to behave abnormally - they could only temporarily function, exiting a failover state, and then again went into a re-failing state due to load.

GitHub specialists in the process It was decided to give priority to preserving the integrity of the data, while sacrificing the availability of the site and services for a while. They had to manually remove all production traffic from the corrupted replicas until they could successfully complete the required rename table step. After all the replicas were restored, GitHub was able to bring them back to production and restore sufficient capacity to bring the service back to normal.

GitHub clarified that it did not record any data corruption during the crash, but during the entire incident write operations worked in a regular way. To address such glitches in the future, GitHub will prioritize functional partitioning of clusters to improve resiliency during migration. Currently, GitHub experts continue to investigate the scenario and causes of a particular failure and have paused the migration until they can determine the measures to protect against this problem.

In the summer of 2020, the main GitHub services were unavailable for more than 2 hours due to a failure in the main MySQL database cluster.

According to the portal Nimble Industries, the number of problems and incidents in GitHub has increased since its acquisition by Microsoft in 2018 and continues to grow. For example, more than two years ago, 89 incidents were recorded, and over the past 24 months there were already 126, which is 41% more. Moreover, the number of minutes when parts of the service did not work is also growing. If from 2016 to 2018 there were 6,110 such minutes, then from 2018 to 2020 GitHub partially did not work 12,074 minutes, which is already more than 100% more. Thus, incidents on GitHub are growing and they last longer, although the service developers try to communicate problems and the progress of their resolution as transparently and openly as possible.

GitHub reported on the causes of the problems on November 27