Some stations on "UK South" are experiencing longer connection times or buffering/reconnections

Incident Report for Broadcast Radio

Postmortem

EXECUTIVE SUMMARY

In the morning of Friday the 17th of February, our internal monitoring systems alerted us to a large number of encoder disconnects on the UK South streaming region. We immediately raised a ticket with the datacentre and set out investigating the source of the fault, and posted an update on the service status page to ensure customers were kept informed. By mid Saturday afternoon, the vast majority of stations were streaming again and we are now seeing that the region has recovered and is stable again.

As part of our disaster recovery procedures on Friday, we moved quickly to switch the UK South region to a different server group located in a different building of the same data centre. Although the overall rates of disconnects dropped significantly, this did not completely mitigate the issue.

Given the severity of the issue with this data centre, we are now working to move our few remaining streaming servers away from that data centre over the next few weeks and will be in touch directly with each station with more information in due course.

ROOT CAUSE ANALYSIS

After troubleshooting with the data centre, we discovered there were bursts of very high throughput of traffic/packets, temporarily congesting the core network connecting our provider's racks to the internet, and also inter-connecting racks. This was causing bursts of packet loss.

As this packet loss affected our internal network, as well as external/internet connectivity, this resulted in a performance penalty to our database servers, as it tried to remain in-sync but was not able to.

Further analysis determined the root cause to be a Spanning Tree Protocol error. One switch in the core network was discovered to be sending STP reconfigure events at a very large rate (one every minute). These events cause each device in the layer 2 network to re-map the best way to communicate with other devices in the network, which can take several seconds, causing packet loss and - due to the large number of devices on the core network - resulting in a large burst of broadcast packets immediately after. This was followed by a large burst of ARP requests as devices recovered and tried to re-discover each other.

Separately, this same STP issue caused a brief issue to the Broadcast.Radio service on Sunday at roughly 19:20, where our database cluster wasn't able to sync as fast as data came into the service. This caused a large error rate, due to the database cluster not being able to keep up with the volume of write requests, ultimately resulting in every node in our database cluster simultaneously exiting. Our team worked quickly to mitigate the fault, which required the re-deployment of the core Broadcast.Radio database, resulting in no more than 20 minutes of down-time. We apologise for any inconvenience caused by this issue.

As of 20:00 on Sunday evening, after receiving further assurance from our network provider, we are no longer observing packet loss or streaming disconnects, and are tentatively considering that the upstream networking issues have been resolved.

Nonetheless, given the far-reaching impact, and the very long time taken by the data centre to locate and mitigate the root cause of the fault, we have made the decision to sunset the UK South streaming region. We understand that our customers place a great deal of trust in us to operate their streaming service, and we sincerely apologise that over the last several days we have not been able to meet this expectation.

Posted Feb 20, 2023 - 11:18 GMT

Resolved

We have verified that network traffic has been stable since mid afternoon Saturday. A post incident report ('post mortem') describing the root cause of the incident will be made available momentarily.

Posted Feb 20, 2023 - 11:14 GMT

Monitoring

The data center is confident that their update has resolved the issue, however they have engineering teams working over the weekend to check for any other possible issues and they will inform us of any other updates as they happen.

Posted Feb 17, 2023 - 16:17 GMT

Update

We've just been informed that the datacenter operations engineers have started to rollout a fix to the affected infrastructure and are starting to see packets of data flowing much more smoothly across their infrastructure.

We'll update as soon as we hear more.

Posted Feb 17, 2023 - 14:03 GMT

Identified

We're currently seeing a high number of streaming disconnections and reconnections on our UK South streaming Server.

We've are in direct contact with with the datacentre engineering team itself and they have confirmed they are are working on a networking and hosting connectivity issue with the datacentre that is affecting the zone we are in.

We have moved our primary traffic to our backup hosting infrastructure within UK South which has significantly mitigated the issue for many stations but we are aware of some stations still having problems with buffering and reconnection.

The datacentre engineers are urgently working on a resolution for this issue and as soon as we hear more from them we will update this status.

In the meantime, if you are critically affected by this issue, we can move you to one of alternate datacentres within the UK - this will require a small reconfiguration to your encoding software to enter the name of the alternative datacentre.

If you are using our streaming redirection service (https://streaming.broadcast.radio) as recommended then there will no need to update your websites or other radio directory listings as the redirector will automatically direct your listeners to the new streaming server. If you are using Broadcast Radio Apps, Skills or Websites, this will also automatically update.

If you need to be moved to an alternate server please raise a ticket with support@broadcastradio.com

Posted Feb 17, 2023 - 12:38 GMT

This incident affected: Streaming, Web hosting, Apps and smart speaker skills (Streaming Services).