Update on November 5th incident - Heart Internet Blog - Focusing on all aspects of the web

Status of Heart Internet

We’ve let you down. Over the last few weeks, we’ve had some challenges in keeping your sites, and potentially your customers’ sites, up and running. In order to understand what happened, we want to share some background.

Earlier this year, we conducted an audit on the hardware in our datacenter. We found several pieces of networking equipment that, although they still functioned properly, we decided to update so we had a more robust and reliable infrastructure in place.

The plan was to start re-designing the network and updating the necessary equipment to provide customers the most reliable and fastest web hosting possible. The plan was in place, the networking was being re-designed, and things were going great. Until they weren’t.

November 5th incident

Instead of breaking down each and every incident that’s happened recently, we want to focus on what happened on 5th November. This example is the most extreme, and is indicative of what’s been happening recently.

At approximately 3am, we were notified of connection issues. Within 30 minutes, we had a team working to diagnose the issue. Due to the network design and the current switches in place, this wasn’t so easy.

Here’s why it was so difficult. Sometimes, a server will send traffic a switch can’t figure out how to handle. As the traffic keeps coming, the switch becomes overwhelmed. When that happens, the switch starts acting erratically, which then cascades to other switches. This cascading effect cripples the network. There are thousands of servers, more than 100 switches as well as other hardware and networking gear that could have been causing the issue. With everything effectively down, finding the proper switch or server was like hunting for a needle in a haystack.

So at 3:30am, the engineers began troubleshooting, trying to identify the root cause of the issue. Unfortunately, this isn’t as quick or as easy as it should have been. All of the networking equipment was acting erratically, making it difficult to identify the problem source.

At 12:44pm GMT, the team thought they had located the problematic switch. They had a switch on standby, so they changed it out, expecting to see the system start to improve. No such luck. That switch wasn’t the root of the problem.

They continued to work to identify the problem hardware. The team eventually identified three switches that were causing issues. Once they were identified, we were quickly able to swap out the impacted switches that brought everything back online.

So … what are we doing to fix it?

Our initial plan to update our networking equipment is going to continue, just at a faster pace. We’re going to do our best to minimize any downtime during the changes, and if our work is going to cause any impact to your services, we’ll write to you and tell you first. When completed, the benefits will be enormous for you. In addition to a more robust network, we are building in advanced telemetry and diagnostic equipment, which will help us identify and isolate problems more quickly.

Another problem … lack of communication

The other area where we fell down as a company was the lack of communication with you. In the middle of an active incident, it’s always difficult to communicate effectively. As we’re trying to locate the issue and work on potential resolutions, we always want to provide the most accurate information, and sometimes getting all the facts together slows us down. That being said, we can and will do better.

We are looking at how to provide more timely and thorough updates. This blog is the first step. We need to engage in a dialogue about what’s happening, why it’s happening and what we’re doing to fix it. Expect more communications like this in the future.

In closing, the situation over the last few weeks has been unacceptable to us as a business and unacceptable for our customers. We apologize. We truly believe great things are ahead and the more robust and advanced network will once again provide the high level of service you have come to expect from us.

Thank you,

Sara Rego, Heart Internet Director

Subscribe to our monthly Heart Internet newsletter, filled with the latest articles about web design, development, building your business, and exclusive offers.

Subscribe now!

Comments

Please remember that all comments are moderated and any links you paste in your comment will remain as plain text. If your comment looks like spam it will be deleted. We're looking forward to answering your questions and hearing your comments and opinions!

Leave a reply

  • Robert Hawkes

    09/11/2018

    Why, if a switch becomes “Overwhelmed”, does it need to be swapped out? Can you not just restart it or does the hardware/software get damaged beyond repair – in which case it is badly designed in the first place?

     
    • Kate Bolin

      13/11/2018

      Hi Robert,

      We swap out the switch so that service can resume faster – rather than you waiting on us to repair the problem, you have a brand new switch working smoothly. We can then repair the switch, and keep it in case of another emergency.

       
  • steve

    09/11/2018

    You would do well to read your own 2014 blog “An explanation”, where exactly the same promises are made. You have learnt nothing in 4 years.

     
  • Davide

    09/11/2018

    Thank you for your explanation but I have had 5 clients that have asked to leave…please tell me what to say to them?

    I have been with you and supported you always, but now it’s getting harder

    Many Thanks
    Davide

     
    • Kate Bolin

      13/11/2018

      Hi Davide,
      Thank you for your continuing support. We know this has been a frustrating time for our Resellers, and we apologise for the effect it is having on your customers. When speaking to them, please give them our apologies and let them know we are resolving our networking issues to prevent this from happening again.

       
  • Ray Spex

    10/11/2018

    Too little, too late. Too many outages, too little investment, not keeping up with industry trends, not leading.

    Have already moved 45 web sites to another better and cheaper host. Expect my cancellation very soon. Bye bye.

     
  • Neil Smith

    10/11/2018

    Sara thank you for this.
    Unfortunately it doesn’t really provide me with much comfort or re-assurance and I’ll tell you why. If you click on the links to the related articles below your post, the same story is repeated again and again after every major incident. “Equipment failure, we are upgrading, our communication was poor – we will improve!” – that is literally a summary of what has been said in every post following a failure for the last 2 years.
    I have no doubt your team worked hard to resolve the problem on the 5th – I’ve been in this business for 20 years and I know stuff breaks regardless of how resilient you make your network. The issue I have is the regularity at which stuff breaks at Heart and the lack of response to that. There has been a downward trend for 2 years and its getting worse.
    Papering over the previous failures in the last few weeks doesn’t help restore my faith either – what did happen there?
    Sara in business if you continue to do the same thing, nothing will change and from what you have said above I don’t think you are planning on doing anything different. There is little evidence of any passion which the creators of Heart put in and the technical explanation of what happened is poor.
    I believe there are key issues in the management at Heart that need to be addressed and it looks like it all seems to go back to the changes that must have been made following the sale to GoDaddy. My experiences with GoDaddy are very poor – they have unexplained outages, communication problems, make transferring domains away from them difficult and are terribly expensive. The marketing is great – the service is shocking. Heart has become part of that and it seems the problems at GoDaddy are infectious.
    After 8 years as a Heart re-seller I’m trying to find a reason to give you another chance Sara and I am really struggling.

     
    • Kate Bolin

      13/11/2018

      Hi Neil,

      We’re sorry to hear this. While it’s true that we have made promises in the past to improve communication, and we appreciate we haven’t followed through on these, always. This has changed in the past year, and we are taking proactive steps to ensure that there are procedures in place for fast and effective communication.

      We know we have failed you in the past. But we are working on making this better.

       
  • Chris

    12/11/2018

    Yes, but this isn’t a one off, is it? For months your service goes down every few weeks for about half a day, with no explanation. It comes back later in the day, so I just put up with it as it’s a hassle to swap providers. This was the worst event in a long line.
    Customers aren’t striving for a more robust network, like you say, they’re looking for a network – plain and simple, that actually works. That’s a pretty basic requirement of a company selling internet services.

     
    • Kate Bolin

      13/11/2018

      Hi Chris,
      You’re right – and we do want to provide a network that works. Our work to make this network more robust will give you what you need, and we are sorry that we were not able to fix this situation before the outage happened.

       
  • Chris

    12/11/2018

    Have you disabled comments on this?

     
    • Kate Bolin

      13/11/2018

      No – we have our comments moderated due to spam. It took us a bit of time to get through the spam and then reply.

       
  • Long Standing Customer

    12/11/2018

    Sorry. Too little too late for me. Migration is taking place. Your apologies feel forced by roaring decent aimed your decline in service over the past few years — made most apparent by the recent 18 hour outage. Not having a comms plan for an outage of this scale is almost as ridiculous as the amount of time you cost myself, my busy customers – the money I’m having to offer in compensation, and now the time it’ll take to migrate all my sites and services to another provider.

    If you consider the scale of lost earnings (99.9% uptime guaranteed?) lost from over 18 hours of downtime then I guess you could look to offer some kind of compensation, but it’s been clear that HI consider that a step too far. So a blog post will do eh?

     
  • Joe

    13/11/2018

    You’re right, your communication was, and always has been, well below industry standard, never mind good.

    You have customers who are highly skilled technically, we build websites for a living, and many of us have been doing it and dealing with hosting companies for decades.

    Single line updates saying things like “We’re working on the problem” just don’t cut it.
    In this day and age, when communication is frictionless and real time, your customers expect actual explanations and actual details about what the problem is and what you’re doing to fix it.

    Aiming your blog posts and updates at your least technically skilled customers alienates the people who are your bread and butter, people like us, who host hundreds of sites on your infrastructure. It makes us think that your team is as technically skilled as the language in your updates.
    If, incidentally, you’d like to see some best practice when it comes to post incident analysis, you could learn a lot from this https://blog.github.com/2018-10-30-oct21-post-incident-analysis/

    Although this blog post is appreciated, it doesn’t get anywhere near reassuring me that we’re not going to have another incident where our team have to spend an entire day telling irate customers that we have no information for them, instead of doing what we’re paid to do.

    We’re software people, we rely on network people like you, and you’ve let us down.

     
  • 13/11/2018

    13 November 2018, Our VPS has been down all day today, we were first told its our server, we spent hours trying to figure out what it could be, but it was challenging as our connectivity to the server was patchy, As I write this I now have no access at all to the server. Support have now said the network team is looking into it. heartstatus.uk has not been updated, I am still not sure what the problem is, meanwhile i have customers wondering why I am not moving them. If i knew our server was indeed the problem i would move them and do the DNS work, but I am stuck in a situation where I have no concrete information form heart whether or not the issue is the server or the network. If the issue is the network, I would like to see the heartstatus.uk page updated regularly, I am currently wasting my customers time and money by waiting.

     
  • Andy Gosling

    13/11/2018

    I, and my clients, lived through the debacle that was the “data centre move”. We lived through the time when a UPS engineer managed to switch the power off to the data centre. Along with a miriad of other more minor disasters.

    I don’t get any of this with the other hosts I deal with.

    I’ve been with Heart Internet since 2005…. I don’t seeing me lasting much longer though. It’s not really getting any better is it Heart? I’ve long heard your promises of better communication and more reliable systems – who else went to the one and only reseller conference in London which was just after another hosting disaster? Nothing has changed since those empty promises.

     
  • Jazz

    14/11/2018

    To be honest, I’m getting dejavu reading this. Every time you apologise for the problem, the total lack of communication, and promise to improve both. As people who work in this industry, we know that this isn’t acceptable in any SLA situation. You make no attempt to offer compensation via business or financially, only some after-the-fact comms to basically apologise and hope noone actually has a problem despite the fact most resellers have customers they actually have to deal with correctly unlike yourselves, and that those customers may not want to leave too. Your business model clearly has the risk factor set to 0 that any resellers would leave even if you force the resellers customers to leave the reseller. I think this is part of your outlook being wrong fundamentally, and if GoDaddy bought you before it all started then that explains a lot.

     
  • 15/11/2018

    The blog post from the director is welcome. However, because of Heart’s inactions, I have had to spend a whole weekend hurriedly moving 40-odd exchange accounts away from Heart and reconfiguring the users’ laptops and phones. I did this to avoid losing them as a customer. I had to spend £400 on migration licences.

    What gets me is the attitude of your frankly impenetrable customer service. Over the summer, we had several outages of your exchange platform. I could get absolutely no explanation of what you’ve been doing and how we might be affected. Just scripted, content-free responses. Then last week with the 18-hour outage, your customer service people apparently had ‘no access’ to the engineer teams, and were unable to offer even the name of a manager. I called Sales, and they were equally clueless.

    You need to empower your staff with the ability to make decisions.

    And just because all you were able to offer was 3 months of reseller service as compensation. , without an explanation as to how you were able to reach that figure, we are going to be removing all our services from heart, and have endeavoured to persuade as many people as possible to avoid your company. What you have to realise is that it’s not just the unacceptable outage, it was the obfuscation and downright lies that seemed to surround it.

    ph and we still haven’t had an explanation of the Exchange outages.

     
  • Mark R

    16/11/2018

    Is there not any sort of error log/diagnostic system that you can implement to identify which switch fails first?

     
  • 18/11/2018

    The fact that you have offered ZERO compensation to your loyal customers (many of whom, like myself, work as developers and/or network/hardware specialists) for your MANY server outages and failures in service is absolutely abhorrent – particularly the most recent.

    I have requested compensation and received NONE – so much for owning your mistakes and saying sorry to a loyal customer since 2005 onwards.

    Your apologies and reassurances mean absolutely nothing because this WILL happen again (if previous patterns are any indicators).

    The fact that many of your customers are having to pay out of their own pocket to compensate clients for lost sales/unavailable services is DISGUSTING beyond belief.

    Surely there has to be some legal mechanism where Heart Internet can be forced to compensate their customers by an ombudsman or similar?

     
  • Pip

    19/11/2018

    It might seem simplistic, but can you not turn it off and on again to see what’s up, when you know you’re looking for a faulty switch? I mean, surely there must be some kind of diagnostic software to identify the first thing that goes wrong.
    However, this is too big an issue after too many other issues so I’m gone after 15 years of saying how good you are.

    I just need to migrate the sites and sort out the DNS forwarding until I can afford to migrate all the domain names.

    Sorry. Since GoDaddy took over there’s been a MASSIVE drop in service level. I don’t know if it’s linked, but it’s the truth.

     
  • Mart

    19/11/2018

    Words are easy, but its action that counts. I don’t see anything that suggests the bad direction Heart has moved in is going to be reversed. We’ve been with Heart over 10 years and for the first 8 of those years we’ve been generally happy, but things have been going downhill for a while.
    This all seems to point at bad management at the higher levels of the company. Bean counters trying to save some money and thinking it won’t impact services, or if it does customer’s won’t walk because its too much work to do so. I don’t see any announcements of management casualties after the latest meltdown, and with the same people in charge, it seems the same course will continue. We’ve taken the decision to move away – the outages have caused lasting damage to our business, we’ll lose customers because of it, and telling them that our host says things are going to change isn’t going to keep them.

     
  • Jonathan Beadle

    20/11/2018

    Enough is enough! We have used Heart Internet for years recently you put your prices up to charge per package per month, costing our company thousands. We are sick of the technical problems and increasing costs.

    We have made the decision to move our 3300 domain names and hosting packages away from Heart Internet to a new provider.

    Good riddance.

     
  • Jason m

    21/11/2018

    It’s great that you’ve taken the time to provide some level of explanation but sadly this falls so very short of being acceptable.
    Lack of communication is one of the key issues here and as others have rightly pointed out, please review other incident reports where communication to resellers was promised to be paramount, yet, each time there is a problem the same old scripted responses are pushed out. Like other resellers, we have lost clients (actually lost them as they have no confidence in our choices due to the poor hosting options we have selected for them!)
    We raised tickets regarding this problem on the 5th of November yet the response we got back made no reference to the ‘known’ issue so either something was amiss within the support centre or someone is lying!
    Is there no connection to the outages suffered in October to this incident? We’re still waiting on some level of compensation by the way and the usual ‘we don’t have an SLA’ rubbish no longer cuts it. We have lost revenue due this outage, lost ongoing custom and I know for a fact that emails that were sent to me for new work did NOT arrive (they weren’t queued up as we were told) so all in all it looks like we’ve lost on a a few £thousands here.
    As a result, we’re migrating clients away which I can see through Twitter, that others are doing too. This could explain why the service has appeared to have improved, less strain on the network due to the mass exodus.

    What is really annoying though is that not only do we have to keep creating support tickets to make anyone aware of the problems we’re facing (you can’t even implement something as basic as UpTimeRobot to monitor a small selection of sites on the servers) but we now have to pay even more money for something that is actually worse than it used to be.

     
  • Rob

    21/11/2018

    We have moved away following this latest incident – comms were poor and service terrible, I tried to phone up and only person available was in sales who didn’t know what was going on, you can imagine how happy this made me. What compensation are you offering to customers given how badly you feel about this?

     
  • Will

    26/11/2018

    …and it’s bye bye from myself after 10 years as a Reseller. The service has been declining over the last 3 – 4 years, and with ever increasing speed since the disaster that is GoDaddy have taken over HI. I am just finalising the move of the last few clients over to the new hosting and then HI can have my Bye Bye ticket and account termination

     
  • Jim

    03/12/2018

    I just phoned to check on the refund policy for reseller hosting. I pay a year in advance, as many will do. There is no refund policy (which I suppose I should have been aware of, but wasnt). There is simply no refund if one wishes to terminate a reseller account early.

    This adds to me disappointment with HEARTINTERNET. I have been with them since the start, and like others who have posted here have sensed a lack of passion behind the service being provided. The outage I can understand (bad stuff happens) but the lack of communications I cannot understand, the lack of any compensation I cannot understand, and now to realise there is no refund if you had paid a year in advance.

    In my opinion, a company that trades on reliability should be showing its own self confidence byoffering a refund on termination of service, even if it is linked with an early-leaving penalty fee (for example the difference between monthly and yearly payments). HEARTINTERNET has nothing here…

    I am very disappointed

     
    • Darren Busby

      04/12/2018

      Compensation? No. A discount on monthly fees going forward? No. and you want us to stay as loyal customers? OBVIOUSLY NOT. Sort something out. I paid out four figure amounts to my customers because their business suffered. So what about MY business? Im trying to sell something thats unreliable and actually costing ME money… When are WE going to see some recompense? Or shall I just folow the masses and migrate……….

       
Drop us a line 0330 660 0255 or email sales@heartinternet.uk