We wanted to thank Myriad Cloud customers, who were affected by the issues with the platform on Saturday, for the understanding.
We want to be transparent about the cause of Saturday’s platform outage, and make clear the steps we have already undertaken, and will be undertaking, to ensure an incident does not repeat.
The original remote access issue was caused by a platform certificate being revoked. Our team quickly prepared a replacement certificate, and deployed it to the affected customers. This resolved the original issue. However, these customers were, by unfortunate coincidence, located on the same hardware node.
However, roughly an hour after this occurred, Microsoft Azure started to falsely detect that components of Myriad Cloud were non-functional and required automated intervention. We confirmed that the affected node was functioning correctly, but was being slow to start/stop/restart stations. We identified the root cause of this was as a bottleneck in the network management system, when too many management commands occur in parallel.
The Azure platform has a series of “health checks” which are designed to pre-empt and intervene in the case of hardware faults, and other issues within the Azure Cloud. Typically this involves relocating radio stations, without any on-air impact, in the case of failures, but the system did not work as it should on Saturday.
In the scenario we encountered, the auto-repair feature in Azure will, first, restart an affected system, if it appears unhealthy for 5 minutes. If it does not appear healthy, the second intervention is a re-deployment of that system (i.e. a re-install of the operating system). This re-deployment is what caused the primary issue - as it meant servers were deployed without Myriad Cloud’s software loaded onto them. As this software takes time to automatically download and install (more than 5 minutes) this caused the system to repeatedly terminate and re-install software on the affected hardware nodes.
To resolve the issue, we had to manually intervene and re-deploy the affected stations, in batches. We completed this work as quickly as possible - verifying that the batch had restored, before moving onto the next batch.
We have raised this issue with Microsoft, to ensure that their automated interventions do not trigger unnecessarily, as happened today, and are having discussions with their platform engineers.
Additionally, we have developed and are currently testing a custom software solution, that ensures such an incident cannot reoccur. It works by preventing all the stations on a server node from being restarted at the same time, using a queueing mechanism.