An explanation

First, and most importantly of all, I want to apologise for the recent disruptions to your services. Over the past week, we’ve been performing essential work with the underlying purpose of improving service for each and every one of you. Unfortunately the process itself generated a series of unexpected complex issues that consequently impacted you.

This was probably the hardest week in our history, both for ourselves, and more importantly, for you and your own customers. I’d like to thank you for your patience over this period, and to take the time to explain what happened, what went wrong, and what our plans are to make sure the service we provide to you is the best it can be in the future.

What happened?

Over a period of 8 days, we moved servers between our data centres. Our entire shared platform, mail, and our name servers literally moved 75 miles , from the colocation data centre in Derby that has been our home for 7 years, to our newly finished state-of-the-art data centre in Leeds. It was the single largest, most complex task we have ever performed. The sheer amount of data involved was truly enormous and changing constantly. It also meant that many staff were needed to be moved off their essential day-to-day tasks in order to help.

We wanted to get the move done as quickly as possible, with a minimum of pain, without taking anything offline…although as you know, things didn’t go exactly to plan. With our new shared platform, we have redundancy at all critical points, so we could move clusters of servers during periods of lowest traffic, that is, between 10PM and 6AM, without interrupting service. Two interconnects were set up between the data centres, allowing us to have hardware at two disparate locations operating as if they were on the same internal and external networks.

The plan

The move was essentially planned as follows:

Set up the infrastructure to take our platform in the new data centre, with the front-end load balancers ready to go.
Move half of a server cluster a night, each cluster is moved over two consecutive days.
Up until the half-way point, all network requests comes in via the old data centre, and if the destination server is in the new data centre, traffic goes down the interconnect, hits the server, goes back up the interconnect, and out.
As we move more servers, traffic across the interconnect increases, up until the half-way point. At this point, we switch the BGP announce over, so requests come in via the new data centre. At this point, the interconnect is used when requests are directed at servers still in the old DC.
Repeat until complete. On the final day, switch over nameservers to hardware set up in the new data centre.

The technical details

With lots of redundancy at each point, what could possibly go wrong? Well, you may have noticed that there’s a lot of reliance on the interconnect, and while these connections themselves were all redundant, problems arose as a result of capacity over this link.

Of course, in the tech world, if something is going to fail, it’s going to do so catastrophically, and fail it did.

We were moving from a simple, flat, ‘old-style’ network to a structured, redundant network (that is a move to spanning tree protocol). What we didn’t realise until too late, was that the old network was actually inconsistent, so parts of it actually had STP enabled, and parts didn’t. This is essentially the worst state for a network to be in, particularly if you’re going to be making any changes.

There were routing problems across the interconnect (or ‘tunnel’, as we’ typically refer to it), which caused a variety of different problems. These also resulted in a higher traffic load across it, which was already higher than expected. And the tunnel became saturated (i.e. we were trying to push more traffic through it than it had capacity, and we’re already talking in the order of gigabits here. At this point, some traffic is dropped, some goes through slowly, and for Exchange, it causes problems with replication. DNS was also being served through the tunnel, which caused additional problems.

Now the real problem here was, that once the job was started, we couldn’t stop, regroup, and rethink our strategy. There is traffic going through the tunnel, and it’s patchy, but there isn’t an alternative. The issues started rearing their head on day 3 of 8, and full resolution could only be reached once the migration was fully complete. A lot of work was done behind the scenes by our network team to ameliorate issues as they popped up, but it was mostly firefighting.

Throughout the process, our system administration team and support guys really pulled out the stops, often putting in 16+ hour shifts, and pitching in even when they were scheduled to be off. All of our staff really care about Heart Internet, and providing the best service possible is of paramount importance. When we get it wrong, it feels like a personal failure.

What we learned & what we’re changing as a result

The most frequent questions were asked was ‘What’s happening? Why are there no details?’ This is the part that we’re now focusing on improving the most. When in the midst of solving a technical problem, there is a strong desire to fix the issue first, and communicate once it’s fixed. Typically, this is a highly efficient method to use for us, because many issues are resolved quickly in a few minutes. They have the most challenging technical duties here, not only because of the technology and systems in play and the skills and knowledge required, but because of the intense pressure to reach a fast resolution. They are the end of the chain, which starts with your customers and visitors contacting you, then continues to you contacting to our support team, and them requesting updates from our sysadmin team. There is a difficult balancing act between providing information and having all hands on deck to fix matters.

We were also asked why we didn’t provide more information upfront. In hindsight, this is definitely something we should have done, but we honestly expected it to cause very few, if any, problems. Our planned status page message therefore didn’t provide enough insight or cover the issues accurately enough.

I think the biggest lesson we learned, other than ‘hope for the best, prepare for the worst’ was to be open, have a stronger process for internal and external communication, and explain everything to you in advance, with greater transparency and more details where possible.

We’ve always been proactive in making improvements and taking your feedback on board at every stage. We’ve had a lot of great suggestions, and hindsight has also proved that there are significant changes we can make moving forwards. Even though we made a once-in-a-lifetime move, we’ve seen that we can be more transparent and keep you more informed about anything we have planned, regardless of how small and how few people are affected.

At the moment, we're collating ideas from every department to come up with a strategy to improve. There’s a significant focus on enhancing communication (and helping you communicate with your customers more effectively as a result). It forms a lengthy discussion of your feedback and comments regarding the situation, in addition to ideas and plans for practical implementation.

Whilst we get those finalised, we’ve decided on some core points to establish a better process starting now:

Advance communication – If we think you’re going to be affected by maintenance work, we’ll email you, at least a week in advance where possible, and tell you what we expect the impact to be. You won’t have to rely on manually checking an external status page every day.

Transparency – If things go wrong, we’ll give you more detail, and an ETA where we can. Very often ETAs are notoriously difficult for a whole range of reasons that need a follow-up blog post to explain,but we’ll keep you informed at each stage where we can.

I also have plans for a Heart Internet specific status page (that is, a non-whitelabelled one) so we can go into a little more detail about issues, especially where it affects Heart Internet specific systems like Hostpay, or the Reseller Control Centre. This might take a little more time, but it’s very important to me that we get this right.

I’m sincerely sorry that we got it wrong, and for the knock-on impact it had on each and every one of you affected. Hindsight is a wonderful thing that we can only learn from, and we’ve collated and digested every single piece of feedback from you in order to ensure we provide the service and information you expect moving forward.

The good news

We wouldn’t perform such a large scale migration if there weren’t some serious benefits. We’ve moved to a fully redundant data centre under our own management. We’re now under the safeguard of a networks team completely dedicated to hosting, and their only job is to make sure that the service we provide is uninterrupted; they can dedicate their time just to us (and therefore you). We no longer have to deal with two distinct teams and network designs when we have issues, so there’s less room for error, and most importantly, you’re entirely in our hands now. If something goes wrong, we’re responsible, and likewise, if everything goes right, that’s us too.

Just to add

Heart Internet means a great deal to everyone who works here, especially me. I’ve dedicated the last 6 years of my life to improving it and putting our customers at the forefront of all of our decisions. I started in first line tech support straight out of university, and I’ve worked my way up the business from the bottom. From my time on the front lines, I know the frustration you face when we get something wrong, or we fail to fix a problem you’re having, because you feel powerless and it has a widespread impact. To those of you who’ve recently joined us: we know you’ve had the worst possible first impression, and we can only apologise and ask you to give us a second chance. It’s the total opposite of our usual service that so many people rely on.

I want to reinforce that we’re dedicated to empowering you and giving you the confidence in us to concentrate on running your business. Every company encounters issues at some point, but no one is more dedicated to improving and fixing our mistakes than we are.

We pride ourselves on having the most knowledgeable and well-regarded customers in the industry, and we’re deeply saddened to have let you down with this when you deserve the best possible experience at all times. We don’t take anything for granted, and we hope we can regain your trust and respect to make it a mutual partnership once more.

I can only apologise again and state that we are committed to making a series of improvements that will directly benefit you and ensure we meet your needs in every way possible. If you have any questions that I haven’t covered, please email me directly at [email protected].

What happened?

The technical details

What we learned & what we’re changing as a result

The good news

Just to add

Comments