Some Myriad Cloud Native stations are unable to login via Myriad Anywhere, and restarted stations may not start fully

Incident Report for Broadcast Radio

Postmortem

We wanted to thank Myriad Cloud customers, who were affected by the issues with the platform on Saturday, for the understanding.

We want to be transparent about the cause of Saturday’s platform outage, and make clear the steps we have already undertaken, and will be undertaking, to ensure an incident does not repeat.

The original remote access issue was caused by a platform certificate being revoked. Our team quickly prepared a replacement certificate, and deployed it to the affected customers. This resolved the original issue. However, these customers were, by unfortunate coincidence, located on the same hardware node.

However, roughly an hour after this occurred, Microsoft Azure started to falsely detect that components of Myriad Cloud were non-functional and required automated intervention. We confirmed that the affected node was functioning correctly, but was being slow to start/stop/restart stations. We identified the root cause of this was as a bottleneck in the network management system, when too many management commands occur in parallel.

The Azure platform has a series of “health checks” which are designed to pre-empt and intervene in the case of hardware faults, and other issues within the Azure Cloud. Typically this involves relocating radio stations, without any on-air impact, in the case of failures, but the system did not work as it should on Saturday.

In the scenario we encountered, the auto-repair feature in Azure will, first, restart an affected system, if it appears unhealthy for 5 minutes. If it does not appear healthy, the second intervention is a re-deployment of that system (i.e. a re-install of the operating system). This re-deployment is what caused the primary issue - as it meant servers were deployed without Myriad Cloud’s software loaded onto them. As this software takes time to automatically download and install (more than 5 minutes) this caused the system to repeatedly terminate and re-install software on the affected hardware nodes.

To resolve the issue, we had to manually intervene and re-deploy the affected stations, in batches. We completed this work as quickly as possible - verifying that the batch had restored, before moving onto the next batch.

We have raised this issue with Microsoft, to ensure that their automated interventions do not trigger unnecessarily, as happened today, and are having discussions with their platform engineers.

Additionally, we have developed and are currently testing a custom software solution, that ensures such an incident cannot reoccur. It works by preventing all the stations on a server node from being restarted at the same time, using a queueing mechanism.

Posted Sep 10, 2025 - 10:07 BST

Resolved

The platform changes were finished around 40 minutes ago and we've been monitoring since and everything is now stable. We'll continue to monitor closely over the next few hours and will of course update here if there are any further developments.
Posted Sep 06, 2025 - 17:42 BST

Update

Whilst restarting some stations we noticed that there is an underlying issue with part of the Azure infrastructure and the way it is managing parts of our platform. We are making changes to this now and will be moving a small number of stations to a new section of the azure infrastructure. These stations will see an additional brief outage while they are moved.

We're sorry that this is taking so long to fully resolve, the vast majority of Stations have been kept on air during this period, although remote access has been interrupted for short periods for some stations.

We'll post another update shortly.
Posted Sep 06, 2025 - 16:44 BST

Update

We have nearly completed the restarts of affected stations, we are down to just the last few stations now which should be completed in the very near future.

We have prioritised active Stations first, so if you are currently only evaluating Myriad Cloud you may find yours takes a little longer before we get to it, but it will be completed soon.
Posted Sep 06, 2025 - 15:25 BST

Update

We are continuing to work on a fix for this issue.
Posted Sep 06, 2025 - 13:06 BST

Update

This fix is now fully rolled out across the entire cloud platform and we are restarting affected stations so they pickup the fix.
Posted Sep 06, 2025 - 13:06 BST

Identified

The issue has been identified and we are deploying the fix now.
Posted Sep 06, 2025 - 12:58 BST

Investigating

Some stations are having an issue with remote login to their Myriad Cloud Native (azure hosted) stations.

We have identified the root cause of the issue and are working to deploy a fix urgently.

We'll update here as soon as we have more information.
Posted Sep 06, 2025 - 12:41 BST
This incident affected: Myriad Cloud (Myriad Cloud Playout (Native/Azure based)).