Microsoft state recent cloud outage recovery was slower then hoped due to staff shortage

Tech giant, Microsoft, has confirmed in a new analysis report that the recent cloud outage at a data centre in Sydney, Australia, was a result of there being an inadequate number of staff to deal with the outage, as well as failed automation.

Microsoft's cloud computing platform, Azure as well as its other services, Microsoft 365 and Power Platform were affected for over 24 hours with users disabled.

Microsoft has had issues with outages recently, as earlier this summer its services of Azure, Outlook and OneDrive were affected due to a denial-of-service attack from a Russian-associated group. Azure also found itself in a concerning situation a couple of months ago in Western Europe, as a storm in the Netherlands caused a fibre connection between two of Microsoft's data centres to be damaged.

Facing the same problem as Microsoft in Sydney, due to likely sharing a data centre, was computer technology company, Oracle, and its cloud software subsidiary, NetSuite. Also, the Bank of Queensland and Australian airline, Jetstar, experienced issues as customers were unable to access necessary functions.

There were initially just three staff from Microsoft operating at the data centre when the outage occurred on the night of August 30th, in Sydney. The cloud outage was caused by a utility voltage sag, which occurred due to an electric storm in the eastern region of Australia.

Sydney was victim to over 20,000 strikes of lightning across three hours on the night of the outage. This also led to approximately 30,000 people in the city losing access to power.

With temperatures in the data centre rising rapidly despite Microsoft's efforts to enable cooling, the tech giant chose to power down two data halls so it would prevent hardware from being run down. The cooling units in the two data halls which were meant to diffuse the matter ended up failing as those went offline due to the voltage sag.

The cooling capacity consisted of five chillers and a further two more on standby. However, the five chillers all ended up failing to operate as intended, with just one of the backup chillers working and functioning properly, whilst the other backup returned to offline status shortly after being restarted automatically.

The reason why the five main chillers could not get going was because the temperature of the chilled water loop surpassed the threshold, and then the main chillers were not able to be manually restarted as there was a lack of staff on-site. Microsoft had no choice but to shut down its servers in order for thermal loads to be reduced, as five chillers are meant to be in operation in the data halls, not just one.

In its analysis of the matter, Microsoft highlighted some of the causes for the delay in restoring its operations and recovering from the cloud outage. This included the three staff who were on duty failing to handle the restarting process for the chillers quickly enough, as it was simply too much to handle for just three people.

The tech giant has gone on to add a further four members of staff to operate at the data centre, to ensure a greater understanding of the issues that occurred in the outage. Also, ways of eliminating the risks that come with the chillers will need to be sought after so that a repeat of the outage incident does not occur again.

Another reason for the delay in recovery was that the emergency operational procedures, which allowed the chillers to be restarted, did not get carried out as fast as there was a significant blast radius. Microsoft plans on finding solutions to improving current automation so there is greater pushback when there is a voltage sag event.

Also, in the future, Microsoft will look into how the chillers' load profiles can be prioritised. This will allow the restarting process for the chillers to take place for the profiles with the higher loads first.

Microsoft