Status of Heart Internet
We’ve let you down. Over the last few weeks, we’ve had some challenges in keeping your sites, and potentially your customers’ sites, up and running. In order to understand what happened, we want to share some background.
Earlier this year, we conducted an audit on the hardware in our datacenter. We found several pieces of networking equipment that, although they still functioned properly, we decided to update so we had a more robust and reliable infrastructure in place.
The plan was to start re-designing the network and updating the necessary equipment to provide customers the most reliable and fastest web hosting possible. The plan was in place, the networking was being re-designed, and things were going great. Until they weren’t.
November 5th incident
Instead of breaking down each and every incident that’s happened recently, we want to focus on what happened on 5th November. This example is the most extreme, and is indicative of what’s been happening recently.
At approximately 3am, we were notified of connection issues. Within 30 minutes, we had a team working to diagnose the issue. Due to the network design and the current switches in place, this wasn’t so easy.
Here’s why it was so difficult. Sometimes, a server will send traffic a switch can’t figure out how to handle. As the traffic keeps coming, the switch becomes overwhelmed. When that happens, the switch starts acting erratically, which then cascades to other switches. This cascading effect cripples the network. There are thousands of servers, more than 100 switches as well as other hardware and networking gear that could have been causing the issue. With everything effectively down, finding the proper switch or server was like hunting for a needle in a haystack.
So at 3:30am, the engineers began troubleshooting, trying to identify the root cause of the issue. Unfortunately, this isn’t as quick or as easy as it should have been. All of the networking equipment was acting erratically, making it difficult to identify the problem source.
At 12:44pm GMT, the team thought they had located the problematic switch. They had a switch on standby, so they changed it out, expecting to see the system start to improve. No such luck. That switch wasn’t the root of the problem.
They continued to work to identify the problem hardware. The team eventually identified three switches that were causing issues. Once they were identified, we were quickly able to swap out the impacted switches that brought everything back online.
So … what are we doing to fix it?
Our initial plan to update our networking equipment is going to continue, just at a faster pace. We’re going to do our best to minimize any downtime during the changes, and if our work is going to cause any impact to your services, we’ll write to you and tell you first. When completed, the benefits will be enormous for you. In addition to a more robust network, we are building in advanced telemetry and diagnostic equipment, which will help us identify and isolate problems more quickly.
Another problem … lack of communication
The other area where we fell down as a company was the lack of communication with you. In the middle of an active incident, it’s always difficult to communicate effectively. As we’re trying to locate the issue and work on potential resolutions, we always want to provide the most accurate information, and sometimes getting all the facts together slows us down. That being said, we can and will do better.
We are looking at how to provide more timely and thorough updates. This blog is the first step. We need to engage in a dialogue about what’s happening, why it’s happening and what we’re doing to fix it. Expect more communications like this in the future.
In closing, the situation over the last few weeks has been unacceptable to us as a business and unacceptable for our customers. We apologize. We truly believe great things are ahead and the more robust and advanced network will once again provide the high level of service you have come to expect from us.
Sara Rego, Heart Internet Director