In the morning of Friday the 17th of February, our internal monitoring systems alerted us to a large number of encoder disconnects on the UK South streaming region. We immediately raised a ticket with the datacentre and set out investigating the source of the fault, and posted an update on the service status page to ensure customers were kept informed. By mid Saturday afternoon, the vast majority of stations were streaming again and we are now seeing that the region has recovered and is stable again.
As part of our disaster recovery procedures on Friday, we moved quickly to switch the UK South region to a different server group located in a different building of the same data centre. Although the overall rates of disconnects dropped significantly, this did not completely mitigate the issue.
Given the severity of the issue with this data centre, we are now working to move our few remaining streaming servers away from that data centre over the next few weeks and will be in touch directly with each station with more information in due course.
After troubleshooting with the data centre, we discovered there were bursts of very high throughput of traffic/packets, temporarily congesting the core network connecting our provider's racks to the internet, and also inter-connecting racks. This was causing bursts of packet loss.
As this packet loss affected our internal network, as well as external/internet connectivity, this resulted in a performance penalty to our database servers, as it tried to remain in-sync but was not able to.
Further analysis determined the root cause to be a Spanning Tree Protocol error. One switch in the core network was discovered to be sending STP reconfigure events at a very large rate (one every minute). These events cause each device in the layer 2 network to re-map the best way to communicate with other devices in the network, which can take several seconds, causing packet loss and - due to the large number of devices on the core network - resulting in a large burst of broadcast packets immediately after. This was followed by a large burst of ARP requests as devices recovered and tried to re-discover each other.
Separately, this same STP issue caused a brief issue to the Broadcast.Radio service on Sunday at roughly 19:20, where our database cluster wasn't able to sync as fast as data came into the service. This caused a large error rate, due to the database cluster not being able to keep up with the volume of write requests, ultimately resulting in every node in our database cluster simultaneously exiting. Our team worked quickly to mitigate the fault, which required the re-deployment of the core Broadcast.Radio database, resulting in no more than 20 minutes of down-time. We apologise for any inconvenience caused by this issue.
As of 20:00 on Sunday evening, after receiving further assurance from our network provider, we are no longer observing packet loss or streaming disconnects, and are tentatively considering that the upstream networking issues have been resolved.
Nonetheless, given the far-reaching impact, and the very long time taken by the data centre to locate and mitigate the root cause of the fault, we have made the decision to sunset the UK South streaming region. We understand that our customers place a great deal of trust in us to operate their streaming service, and we sincerely apologise that over the last several days we have not been able to meet this expectation.