Important update about Heart Internet's downtime - Heart Internet Blog - Focusing on all aspects of the web

Dear customers,

As you may be aware, we recently suffered the worst single incident in our history due to a power outage at our Leeds data centre on Wednesday afternoon.

Emergency maintenance work was being carried out on the load transfer module, which feeds power from our external energy supplies to the data centre hall that holds the majority of our servers. The data centre has 2 dual feed uninterruptible supplies both backed by diesel generators in case of National Grid outages.

Unfortunately, a safety mechanism within the device triggered incorrectly, and resulted in a power outage of fewer than 9 minutes. Subsequently, this caused approximately 1,500 physical servers to be hard booted. Beyond a fire, this is the worst possible event that a hosting company can face. A full post mortem is currently being carried out to determine how power was lost on both supplies despite working with the external engineer from the hardware manufacturer.

What happens when servers hard reboot?

Web servers and virtual servers typically perform database transactions at a very high rate, meaning that the risk of database or file system corruption is quite high when a hard reboot occurs.

Following the restoration of power, our first priority was to get our primary infrastructure boxes back online, then our managed and unmanaged platforms. Our managed platforms are built to be resilient, so although we lost a number of servers in the reboot, the majority of our platforms came up cleanly. We faced some issues with our Premium Hosting load balancers, which needed repairing, so some customer sites were off for longer than we would have hoped. We are adding additional redundant load balancers and modifying the failover procedure over the next 7 days as an extra precaution for us and our customers.

On our shared hosting platform, a number of NAS drives, which sit behind the front-end web servers and hold customer website data, crashed and could not be recovered. However, they are set up in fully redundant pairs and the NAS drives themselves contain 8+ disk RAID 10 arrays. In every case but one, at least one server in each pair came back up cleanly, or in an easily repairable state, and customer websites were back online within 2-3 hours.

In a single case, the cluster containing web 75-79,  representing just under 2% of our entire shared platform, both NAS drives failed to come back up. Following our disaster recovery procedure, we commenced attempts to restore the drives, whilst simultaneously building new NAS drives should they be required. Unfortunately, the servers gave a strong, but false, indication that they could be brought back into a functioning state, so we prioritised attempts to repair the file system.

Regrettably, following a ‘successful’ repair, performance was incredibly poor due to the damage to the file system, and we were forced to proceed to the next rung of our disaster recovery procedure. The further we step into the disaster recovery process, the greater the recovery time, and here we were looking at a total 4TB restore from on-site backups to new NAS drives. (For your information the steps following that are to restore from offsite backup and finally restore from tape backup although we did not need to enact these steps.) At this point, it became apparent that the issue would take days rather than hours to resolve, and the status page was updated with an ETA. We restored sites to the new NAS drives alphabetically in a read-only state and the restoration completed late Sunday afternoon.

A full shared cluster restore from backups to new NAS is a critical incident for us, and we routinely train our engineers on disaster recovery steps. Our disaster recovery process functioned correctly, but because the event did not occur in isolation, we were unable to offer the level of individual service that we really wanted to, and that you would expect from us (e.g. individual site migration during restoration).

Given the magnitude of this event, we are currently investigating plans to split our platform and infrastructure servers across two data centre halls, which would allow us to continue running in the event of complete power loss to one. This added reliability is an extra step that we feel is necessary to put in place to ensure that this never happens again for our customers.

VPS and Dedicated Servers

For our unmanaged platforms (VPS and Dedicated Servers), the damage was more severe, as by default these servers are not redundant or backed up. In particular, one type of VPS was more susceptible to data corruption in the event of a power loss due to the type of caching the host servers use. We have remedied this issue on all re-built VPS involved in the outage, and no active or newly built VPS now suffer from this issue.

We did lose two KVM hosts (the host servers that hold VPS, approximately 60-80 servers per VPS KVM host, 6-12 servers per Hybrid KVM host). The relatively good news was that the underlying VPS data was not damaged, although further to this, we also lost two KVM network switches which needed to be swapped out, which did result in intermittent network performance on other VPS during the incident.

To bring the VPS back online, the KVM hosts needed to have replacements built and VPS data copied from each before being brought back online. For every other VPS, the host servers were back up and running within 2 hours, but in many cases, the file systems or databases of the virtual machines on those servers were damaged by the power loss. For these VPS, by far the quickest course of action for customers to get back up and running immediately was a rebuild and restore from backups (either offsite or via our backup service).

However, we realised quickly that many of the affected VPS customers did not have any backups (irrespective of whether the backup was with us), and the only copy of the server’s data was held in a partially corrupted form on our KVM hosts so we took steps to attempt to get customers back online. For every affected VPS we ran an automated fsck (file system check) in an effort to bring the servers back online in an automated fashion. This would not, however, fix issues with MySQL, which would be the most common issues due to high transaction rate. Tables left open during a power loss are likely to result in corrupted data, so we provided a do-it-yourself guide to try and get MySQL into a working state.

We provided the option for us to attempt a repair, which typically takes 2-3 hours per server with an expected success rate of approximately 20%. We currently have a backlog of servers we have agreed to attempt to recover, but given the time per investigation, this is likely to take most of the week. This is roughly equivalent to the total loss of our NAS pair and is where disaster recovery steps (server rebuild and backup restoration) should be followed.

As these servers are unmanaged, there is no disaster recovery process in place by default. I know this isn’t the answer many of you want to hear, and most of all we want to ensure that this can never happen to you again. All VPS hosts are now set to be far more resilient in the event of a sudden power loss.

Support and Communications

During this incident, we have worked our hardest to ensure that our entire customer base was kept informed of our progress through our status page.

Given the scale of the issue, the load on our Customer Services team was far in excess of normal levels. On a standard day, we handle approximately 800 support tickets, which can rise to 1600 during a fairly major incident. At absolute capacity, we can handle approximately 2000 new tickets per day.

This event was unprecedented, so during and following the incident we received in excess of 5000 new support tickets every day (excluding old tickets that were re-opened), and the ticket complexity was far higher than usual. Our admin system was not set up to handle this number of requests (being poll heavy to give our team quick updates on our ticket queue). This heavily impacted the performance of our control panel and ticketing system until we made alterations to make it far less resource intensive.

After this, we took immediate steps to ameliorate the incredible support load via automated updates to affected customers, but most of the tickets required in-depth investigation and server repairs that require a high level of technical capability, so could only be addressed by our second line and sysadmin staff. It will take some time to clear our entire ticket backlog and restore normal ticket SLAs.

We had planned to go live with a brand new Heart Internet customer specific status page on the day of the outage, as it would allow us to provide greater detail for direct customers without the requirement that messages be white labelled and generic.

We did not push this live during the incident as we needed all hands on to fix the live issues, but we have just made it live at status.heartinternet.uk (it will later also be available at https://heartstatus.uk using external DNS). The service allows for subscription via email, SMS, and RSS, so you will be kept up-to-date during any major incident. Past events are also archived and remain fully visible. We will also use this page to inform you of any changes to the platform or scheduled work.

 

Most of all we’d like to apologise to you, and to your customers. We know as much as anyone how important staying online is to your business. The best thing we can do to regain your trust is to offer good, uninterrupted service long into the future, and that is now our utmost priority.

Subscribe to our monthly Heart Internet newsletter, filled with the latest articles about web design, development, building your business, and exclusive offers.

Subscribe now!

Comments

Please remember that all comments are moderated and any links you paste in your comment will remain as plain text. If your comment looks like spam it will be deleted. We're looking forward to answering your questions and hearing your comments and opinions!

Leave a reply

  • 15/02/2016

    Whilst last week was pretty bad it could have been far worse but for the diligence of your team.

    The improved communication (introduced a couple of years ago) enabled us to support those of our customers whose websites are hosted on your shared platform.

    Your practice of randomly allocating new hosting accounts to different shared servers also helped reduce the impact on our business and customer base.

    Having recently survived the Joomla! 0-day Hack, during which time we received fantastic patience and support from your Technical Support Team, we are more than happy to reciprocate in kind!

    You are right to focus on improving systems and procedures moving forward. We have commenced the exact same process here following the 0 Day Hack.

    They say ‘every cloud has a silver lining’, and it is true!

    ‘It never rains but it pours’ is another old adage. And the recent floods here in the North West have highlighted the need to focus on becoming more RESILIENT in the face of major set backs.

    It seems Heart Internet is doing just that. Well done!

     
  • Kristina Harmsworth

    15/02/2016

    You guys have always been upfront with us about EVERYTHING and once again you do not disappoint. I have no problems with outages as long as the customer services are good. Yours always have been. 🙂
    Having worked in customer service I know how important that part of a service company can be.
    I came back to you on Sunday knowing full well about the problems and was still amazed that I managed to get up and going by this morning!
    I will always tell everyone that you guys are the best!
    Thanks for the status update emails that is a great idea.

     
  • Davide Broccoli

    15/02/2016

    Great post HI Like I have now been poached by two companies that want my business, I politely told them where to go and told them I would not leave you…. Keep up the good work…… An open day would be a good idea so we can see and report back how good you are…

    Many Thanks
    Davide

     
    • Michael Shillingford

      15/02/2016

      Much appreciated Davide – the open day is something a few customers have suggested to us. We’ll have some chats internally and see what we can come up with.

       
  • Keith stoddart

    15/02/2016

    These things happen. It was worrying but you seemed to have it under control. Well done for handling it the way you did.

     
  • Sandy Donald

    15/02/2016

    It is very reassuring to hear a clear and fullsome explanation and an acknowledgement that a real live major incident shows that previous contingency plans could have been better. Good communication with your customers is very important and it is evident that you fully understand this. Too many times, service providers fail to maintain and develop communications with affected customers during a major incident and this usually leads to a downward spiral of ever increasing pressure on communication.

    Although I have a very small IT support business I learned a lot of my trade in an enterprise environment of global proportions and I know that there is no such thing as a fool-proof contingency plan and that “all hands to the pumps” is not the answer when tens of thousands of customers are affected and want some sort of “answer”

     
  • Alex Tovey

    15/02/2016

    I am very disappointed at your update. You seem to be getting worse and worse on downtime. Affecting not only OUR company and our customers in which we have lost faith in you. The fact you do not give us compensation or a opt out of our contracts.
    As all these comments seem to be from different people trying to save your reputation but as from myself and my customers your reputation has been destroyed.
    We our wanting compensation or something as a good will gesture to make us all feel safe at staying.

     
    • Craig Cotter

      16/02/2016

      Hi Alex,

      One of the biggest reasons for us wanting to move to a new status page is that it keeps us honest. Historic downtime and degraded service is recorded for posterity, and problematic servers will be far more apparent to both ourselves and you, our customers. We implemented a new benchmarking system that measures and records performance across our entire shared platform, that will point out any servers that are performing comparatively worse versus the rest of the platform. We’ll then be able to prioritise the replacement of problematic hardware.

      For any compensation relating to downtime, you should receive an email shortly.

       
    • Peter Gordon

      16/02/2016

      Completely with Alex Tovey, HEART “You seem to be getting worse and worse on downtime” – What exactly do you expect us re-sellers to tell our clients “AGAIN” – For me personally compensation is not what I am seeking – just a more stability and to be frank better support and honest support – The outage was specified as DDOS Attack, a power outage would certainly be a step in addressing this, MY HONESTY

       
      • Craig Cotter

        16/02/2016

        Hi Pete,

        We agree – the old status page was not sufficient for comms during an incident, as it requires active participation (visiting a site) from Resellers, rather than us telling you directly. Our new status page offers subscription options and we’re currently considering the option of automatically subscribing all Resellers to the service (or providing an opt-in/opt-out).

        There was a small scale DDOS at the point of outage, and was the last status message we were able to post via our normal interface. When we lost power, we lost connectivity and access to the standard tools to update the status page. It was necessary to update the page via FTP using mobile Internet before we were back online, but this meant that the only message on the status page for the first few minutes of the incident was the DDOS. This issue will not be apparent in future. We also plan to publicly display our platform uptime (rather than appearing to hide behind a status page that only displays active issues).

         
        • Lee

          16/02/2016

          Craig,

          I have been requesting this feature for the past 5 years or so using the reseller feedback form on how Heart Internet could improve services. It’s something you have been aware of for quite some time and only now after experiencing a major outage are you implementing it but not on the white label webhostingstatus.com page, just for your direct customers. What about resellers customers? And what about resellers themselves, do they not deserve to provide their customers with the same level of service or are their customers still expected to manually log on to a website for updates and raise tickets through their support control panels?

          I’m very disappointed that you have been reactive regarding this feature and not proactive/preventative, several of our customers have migrated away due to this outage as they cannot be left in the dark waiting for services to be restored, this is very frustrating!

           
  • Les

    15/02/2016

    I have been a reseller with HI for many years, I personally would like to thank all the staff at HI for their professionalism, dedication to service and for working all hours to bring services on line.

    None of my clients suffered any data loss and I am staying here.

    Thank you Heart Internet

     
  • shaly

    15/02/2016

    Well, My hosting service is disabled and database is deleted, very poor handling of support. In the mean time outages. Our website is down for 2 weeks, no help offerred

     
  • 15/02/2016

    Hi

    I like to put on record my thanks for the efforts of everyone at Heart to recover from what was an extra ordinary every.

    I think the efforts of the team to restore services under very difficult and stressful conditions needs to be acknowledge and indeed in my view given the size and scale of the outage, their skills and training proved to be an asset which reinforces my view of Hearts commitment to the service it provides.

     
  • Andy

    15/02/2016

    Just goes to show that if it can happen it will so plan for the very best, defend against the worst imaginable. Great work guys on getting us back online so quickly after such a huge meltdown, shows that you have an amazingly talented team at HI. I’m here to stay 😀

     
  • Emmanuel

    15/02/2016

    Having been with you for over five years now I have not experienced a fault of this magnitude with your systems. Every time that I have reached out to your support team on other issues I have always found a good experience. Your support is great and your products are easy to understand. Keep up the good work and this fault does not make me reconsider my loyalty because you rock!!

     
  • Jean-Marie & Isabel, Reseller

    15/02/2016

    Thank you to all of yours : the H.I. Team ! Your hard work, your transparency and you great level of Communication are remarkable ! We trust in your Team and in your developement !

     
  • Jonathan

    15/02/2016

    Thank you.

     
  • Johann Joubert

    16/02/2016

    Thank you for this article and thank you for the effort to restore everything ASAP. Last week was a particularly bad one for me as I was in hospital with a Gallbladder removal.. in the midst of which this disaster happened. Luckily my clients were understanding and I haven’t lost a single one of them. This just goes to show that if your level of service and customer care is of a very high standard you tend to get through the odd disaster unscathed.
    Regards
    Loyal reseller since 2009

     
  • Dave Woodard

    16/02/2016

    Although I praise you for your Blog and the facts that caused this issue. I am a little underwhelmed with your forward thinking. to put all of your egg in one basket (Leeds), is a little lack of forward thinking. also the service you supply is a service. we pay for that service…. but I see no offer to resellers for the time spent ringing customers, and trying to reassure them that the issue would be sorted. I do think that you need to either look at a way of telling your resellers “ops we messed up and” reimbursing either 1 months hosting payment or some form of compensation

     
    • Craig Cotter

      16/02/2016

      Hi David,

      We’ll be contacting all affected customers individually – as for the points about single DC hall – agreed, our expansion plans do include increased resiliency via use of multiple Data centres.

       
      • Dave Woodard

        18/02/2016

        Thanks Craig. I understand and have seen this sort of thing 1st hand, so i do praise you and your techs for a good job. but again reaffirm that Heart will grow, and you need to look to the future. as customers will wear this once, maybe twice. but i remember when i 1st came to heart from streamline, they were down for 6 weeks, and the service was intermittent and this is why i left…. please don’t let this be heart….

         
  • 16/02/2016

    I do understand that this was a catastrophic occurrence, but the incident did go on for an extremely long time. I understand that you will have suffered in terms of reputation, but so did we. I have been a re seller for a number of years and moved to Heart after having bad experiences with other hosting companies. I have been very pleased with your services, but am worried that over the past few months you have had a DDOS attack and this major breakdown. Both incidents have caused concern for my customers and damaged my reputation. I have just lost 3 customers as a result who are moving all their services to a new web development company who contacted them and poached them. For me this is not just loss of hosting income, which is relatively small, but SEO, PPC and email marketing services. This means a loss in recurring income that I have enjoyed from them for several years. I am fighting to reassure 6 other customers that they should give me another chance.
    The reason I use a hosting company to manage the hosting of my websites is that I do not want to be a server manager and have to provide 24/7 support. That is also the reason I do not have a VPS or dedicated server. Therefore, the hosting company we choose is very important. For me the key worries after last week are:
    • The contingency plan to overcome this
    • The time the problem lasted – 4 days is a long time for business websites.
    • The very poor communication.
    • All avenues of communication being at best difficult often non-existent.
    • No phone access, it was constantly blocked, I tried calling but the phone just cut off.
    • Ticket system got clogged up and was therefore useless
    • The https://www.webhostingstatus.com/ was updated so infrequently that it left me in the dark. How is https://status.heartinternet.uk/ going to be better?
    • Because I could not give sensible answers several customers were asking ME if I know what I was doing.
    • I now have to rebuild confidence with my customers, some of whom lost sales when ecommerce sites were down for a prolonged period. What are Heart going to do to help me rebuild these relationships? Saying sorry is not enough for them. They want concrete reassurances; several have told me that my apologies are cheap and are hollow because I could not guarantee Heart Internet servers.
    I need some ammunition please to save my customers and not apologies.

     
    • Craig Cotter

      16/02/2016

      Hi Gordon,
       

      First of all, my apologies – feeling helpless but being held responsible during times of crisis is something I want to absolutely avoid for all of our Resellers, you are our core customers and the reason we exist today. To reply to your points of concern:
       

      • The contingency plan to overcome this
       

      In the short term, we’re conducting an immediate investigation into how a piece of low risk maintenance work had the potential to cause a catastrophic power loss.

      For unmanaged customers, we’ll be taking steps to help them put a disaster recovery plan in place.

      In the medium term, we already have backup nameservers in a German datacentre, we plan to expand the system to be fully redundant.

      In the medium to long term, we hope to split our core infrastructure, and later hopefully platform servers across multiple data centre halls (and later data centres). This would allow us to continue operations during the complete loss of, or loss of access to a data centre.

      In the long term, we have access to multiple data centres in France, Germany the US and the UK. We hope to allow customers to provision to the data centre of their choosing
       

      • The time the problem lasted – 4 days is a long time for business websites.
       

      Web 75-79 was a worst case scenario restore, something we’ve only had to deal with less than 6 times in our history. Occurring in isolation we would have been able to get customer sites back up and running slightly faster (and get priority sites back up and running using other clusters).
       

      • The very poor communication.
       

      Given the scale of the issue, we had to rely primarily on communication out to our entire customer base. We endeavour to always operate with a support surplus so we can deal with major incidents, but the contact rate was over 2.5x that of previous incidents. In situations such as this we rely on our status page for primary comms. Given that our existing status page was designed to be white-labelled for Resellers, we were put in the difficult situation that updates needed to be intentionally vague and could not reference our internal systems in any meaningful way. The new, Heart-branded page (that will sit alongside the existing one) will allow us to provide far greater detail without risk of breach of whitelabelling.
       

      • All avenues of communication being at best difficult often non-existent.
       

      We’ve taken steps to ensure that our ticketing system remains accessible during times of very high load. We intend to publicise expected wait times for tickets during times of crisis, as we want to set realistic expectations. We also plan to create support guides/boilerplates for possible disaster scenarios. This would allow us to get an FAQ and fix guide out within an hour, rather than putting it together during the incident. The biggest hit to our support teams was on our unmanaged platforms, and the best measures we can take here are preventative for the future. We want to ensure that everyone has a plan of action in case of data loss, and we’ll be offering customers assistance in setting up regular, scheduled backups for their business critical data, to protect them in every eventuality (server crash, act of god, power loss).
       

      • No phone access, it was constantly blocked, I tried calling but the phone just cut off.
       

      We currently only offer accounts and billing support via telephone which was added late last year. We do want to make it easier for you to contact us but there is no way we could roll out phone support with our existing support team. When dealing with an extremely high volume of contacts, automation allows us to ensure that a far greater percentage of our customers receive a meaningful update.
       

      • Ticket system got clogged up and was therefore useless
       

      Covered this one above – ticketing should be far more responsive in the case of very high load in future.
       

      • The https://www.webhostingstatus.com/ was updated so infrequently that it left me in the dark. How is https://status.heartinternet.uk/ going to be better?
       

      The tools for updating the new status page are considerably better (the existing status page could only be updated via FTP when internal tools were unavailable), and historic updates are saved, it also allows us to mark individual services as degraded – you can also now subscribe to see scheduled maintenance. Beyond this, what sort of detail would you like to see? For the lost shared cluster we provided ETAs where available, but we were initially mistaken in the belief that we could bring the existing NAS drives back up without a rebuild. We didn’t make it initially clear that expected fix time had shifted from several hours to several days until we had begun work on the backup restore.
       

      • Because I could not give sensible answers several customers were asking ME if I know what I was doing.
       

      Hopefully, more system-specific status updates will allow you to cut and choose how you inform your customers in future.
       

      • I now have to rebuild confidence with my customers, some of whom lost sales when ecommerce sites were down for a prolonged period. What are Heart going to do to help me rebuild these relationships? Saying sorry is not enough for them. They want concrete reassurances; several have told me that my apologies are cheap and are hollow because I could not guarantee Heart Internet servers.

       

      I hope my points above go some way to providing you with the reassurance you need. We will be further strengthening our teams to ensure that we can deal with more simultaneous issues in the future.

       
  • James

    16/02/2016

    It seems that I am the only one who is extremely dissatisfied.

    Casting my mind back to 2014, when we suffered the last major outage, we were promised afterwards “advance communication” and “transparency”. Neither of these happened at all.

    Granted, HI wasn’t to know the power would go down – it sounds like nobody could have predicted that, but “transparency”? There was none of that at all.

    I know that I, and many other customers and resellers of HI, have felt stonewalled and ignored. This is largely down to failure to provide us with details, timeframes, and explanations about what’s actually going on. I spent over 4 days worrying that my clients were going to suffer data loss, and despite asking that question directly over and over again, via support tickets, Twitter and Facebook, I was given no response. I can appreciate how busy you all must have been, and under a great deal of pressure, but I know a lot of people felt completely isolated – and that’s what’s really upset your customers more than anything.

    Despite all your talk in 2014 of “what we’ve learned”, you failed to put in to practice anything you had learned at all. So much so, that you only deliver the system status thing mentioned in 2014 yesterday. That hardly sounds like it was a priority for you.

    I’m not suggesting that this hasn’t been a nightmare for you guys as well – I’m just giving you honest feedback in the hope that you take it on board for the future. Cheers.

     
  • Derek Cook

    16/02/2016

    I am a Reseller and had many of my clients complaining so I really appreciate the explanation and up-front detail which I can share with them. It all makes sense to me and if I or my clients thought is was frustrating and hard I am sure it must have been a day form hell for Heart with lots of “headless chickens” and stressed tired people working flat-out. So well done for getting it sorted as I am sure other companies would have taken weeks to recover. The great thing is that from a disaster is to learn and correct. Kudos from me!

     
  • Pip

    16/02/2016

    Hi!

    I’ve been with you guys about thirteen years now, and this is the first major outage in all that time. I couldn’t be happier with the service! And thank you for the very full explanation – I’ll be showing this to my customers as well to reassure them that the site should never be unavailable again.

    However, I saw some mention of a DDOS. Although this wasn’t involved in the power loss, was it a thing? Is it still a thing? I am running slower than expected.

    Also, finally, were any emails lost whilst the service was down?

    Thanks!

     
  • SUP

    16/02/2016

    Thanks for sorting out the issue and letting us know of the progress throughout the outage. It was a hard time for everybody, and I guess even more so for yourselves.

     
  • Darren

    16/02/2016

    Thanks for the detailed response. I don’t think anyone doubts you did everything you could. I have 2 questions, firstly I’m part of that 2% affected (web75-79) but my data still hasn’t been recovered. These were new sites that now have zero files on the server. You mention this was completed Sunday afternoon, but it’s Tuesday now and still no files.

    Secondly, when the outage first happened, your support page said it was due to a DDoS attack. Can you tell us why this information was given, and how you could be so mistaken?

     
    • Craig Cotter

      16/02/2016

      Hi Darren.

      Web75-79 were restored from the most recent backup. Depending on the site, the backup would have been a minimum of 1 day and maximum of 7 days old. If the site was under 7 days old, there is a chance that it had not yet been backed up for the first time.

      As for the DDoS, there was a small scale DDOS against one of our webservers just prior to the outage. We lost all connectivity and external access when the data centre power was lost, so the status had to be updated manually via FTP over mobile, so it took a few minutes. As a result, it took longer than we would have liked to replace the old status (DDoS) with the updated one.

       
  • Jai

    16/02/2016

    Great! Just got below reply from your customer service department.

    “Thanks for contacting us.

    I am very sorry but that there is no ‘Service Level Agreement’ on a standard account with a shared server package. This is in the terms and conditions that you agreed to.”

    This is not the first time we had bad experience with you. Also this letter is good but it’s not the solution and it certainly doesn’t protect us from similar disaster. If renewal of £5 product failed we would be blocked from accessing the cpanel and how would you expect us to live with almost 5days of downtime?

     
    • Craig Cotter

      16/02/2016

      Hi Jai,

      We plan to make alterations to our billing system and account lockout procedure that will allow customers to log in and manage their account even while there are outstanding invoices on the account. This will give customers far more leeway if any systematic payment issues occur. It would also ensure that customers are not locked out for long periods during times of crisis.

       
      • Jai

        16/02/2016

        Our business incurred a reputation damage due to the site being down. I did a raise a ticket 1602150330. How can we get compensated?

         
        • Craig Cotter

          16/02/2016

          Hi Jai,

          For every customer severely impacted with the downtime we’ll be contacting you individually.

           
  • Jai

    16/02/2016

    I wondered why I am only reading the comments below which are only positive about HI but now I got the answer. They are moderated and I am sure this won’t make it to the list.

     
  • Jump Digital Ltd

    16/02/2016

    We’re a long time Heart customer and there’s a reason we recommend your hosting to all our clients. We’ve always received phenomenal customer support from you and your team and this post is yet another example of going the extra distance to keep us informed and up to date; allowing us to do the same for our clients.

    During this entire power outage we needed a site migrated from one account to another and while we didn’t expect it to happen until everything was up and running properly again, your team went above and beyond and finished off our support ticket/migration without incident.

    Thank you all for continuing to offer excellent service, even in the midst of chaos!

     
  • Mark

    16/02/2016

    May I suggest you look at something like VMware Site Recovery Manager? This is designed for the situation you just found yourself in. Also, VMware Fault Tolerance ensures the continuous uptime of business-critical services when power fails. (and no I don’t work for them!)

     
  • Dee

    16/02/2016

    I can only say well done, as with everything lessons can be learn’t and I am sure Heart will get stronger. As a reseller of 10 years now I still have faith in you. Although I must admit this outage did make nervous indeed.

    All the same thank you from me and all my customers … WN6 Creative, Wigan

     
  • Karl

    16/02/2016

    Nice post. Thank you for the full update and good work for getting it all sorted out. The new customer specific status page is a brilliant idea esspecially with the email and SMS features. I am really pleased to see it finally added.

     
  • Andy - Oxford WebHosting

    16/02/2016

    Thanks so much for explaining exaclty what happened so that I have something intelligent to say to my hosting customers. Also for getting things back up and running without too much fuss. I think that I only had a day with things not working properly. Sounds like one of those ‘one in a million’ kind of things… and we all survived it! Well done guys, this is why you’re the best around. You make MY company look good too! Win 🙂

     
  • 16/02/2016

    Are the recovery issues as described from the power failure, will it effect the download of backup from our hosted site?

    I know this is small beer in your sphere but we are experiencing unwanted AWS crawlers on our site and need to change the WordPress config files but need a clean restore file if I need to restore.

    Many thanks

     
  • 16/02/2016

    Thanks for keeping us updated about this nightmare of an issue. I have always found Heart Internet a reliable company to work with and their customer support has always been second to none. Its good that you have taken the time now to fully explain what happened. As a reseller I found the update page to be a useful resource to share with my customers to explain what was happening.

    Fortunately for me most of my clients websites were back up and running again within hours, most didn’t notice there was an issue. I am glad Heart Interenet made a priority of recovering the email servers first as this would have caused my clients the most problems. Only one site was down for a longer period – 4 days, as it was on Web77. I am lucky that this particular client hasn’t seen it as an issue.

    On reflection however, I think you do need to address the issue of liability here. We all know disasters occur and recovery can take time. Someone earlier mentioned flooding in their reply. If this issue was caused by a flooding incident for example, there would be insurance against that type of risk. Heart Internet would be recompensed for the disaster and their recovery and restoration costs covered.

    What hapens with websites hosted by Heart Interenet though. If one of my clients was running an eCommerce business and it was their primary income stream could they seek recompense from me for loss of business? As the designer and hosting provider for their website am I liable for the infrastructure that supports it? If my client sues me for damages, can I seek compensation from Heart Interenet. Does Heart Internet have insurance for this disaster?

    I am aware that there are insurance policies out there that cover cyber risks and data breaches but this event was not caused by such a breach. Perhaps Heart Internet should ensure resellers are aware of the limit of their liabilities and maybe provide a pointer to some insurance policies that will cover the risks to us of websites hosted with them going down in future.

     
  • Karan

    16/02/2016

    Found service to be appalling. Issues have been caused by Heart Internet safety procedures failing and we have been left with a non-functional website. Response to support tickets has been nothing short of atrocious and when we do get a response it is unhelpful and does not provide a resolution.
    First problem we have experienced with Heart Internet and its left us without a operating business.
    Getting customers websites operational has definitely not been the priority. Heart internet is going to losing me as a customer.

     
    • Craig Cotter

      16/02/2016

      Hi Karan,

      Do you have a managed (shared/reseller/premium hosting) or unmanaged (VPS/dedicated) product with us? All managed platforms should be back up and running at close to full strength now. If you have a VPS that suffered corruption to MySQL or its file system, our recommended course of action is to rebuild the server and restore from your latest backups. It is possible that the server may be recoverable (by manually making manual edits to the MySQL db that will get it into a readable format) but the chance of this is not high, and there is currently a large backlog of VPS tickets to be handled. I’d rather be honest with you than give you false hope – a restore from backup is the quickest route to getting back online.

       
  • Peter

    16/02/2016

    Offsite DNS would be wise so that email destined for your hosted domain will get gracefully queued rather than bounced as unable to lookup.

     
  • 16/02/2016

    This was just another reason to add to the many of why I am in the process of moving my clients away from Heart Internet.

    I’ve been told repeatedly that slowness is down to the plugins on my website.
    Funnily though when I moved my main site to a new host (exact clone of site). It loaded in 1.4seconds instead of 68secs.

    So I moved another. Same result!

     
    • Craig Cotter

      16/02/2016

      68 seconds sounds like your website was trying an internal or external call (possibly a loopback/cURL request to the host webserver server) that was failing, and the rest of your site wasn’t loading in parallel – the rest of the site didn’t load until the first request died due to timeout.
      If it’s a WordPress or other CMS site and you have a number of plugins, it’s quite likely that one of them was attempting a blocked request that isn’t blocked on the other host. If your site needs loopbacks we can enable this on a per-case basis.

       
      • 16/02/2016

        Craig – I can’t believe what I’m reading here re your comment on loopbacks! A few months ago we were using WooCommerce / WordPress for an important new client commission. After long hours of trying to figure out why a particular membership plugin wasn’t working we narrowed it down to a restriction on loopbacks. We requested loopbacks enabled and were told that it was a “security issue” and “no can do”.

        In fact WooCommerce support responsible for the plugin rubbished this explanation and suggested that our hosting provider was failing us.

        In the end we went to additional trouble and expense to migrate the site to a new host – and this solved the solution.

        (I appreciate that the current issues are being addressed but I’ve spent the day talking to Rackspace and other providers … some of whom are obviously looking to convert new customers off the back of your troubles …. and we may, I fear, be one of them)

         
        • Craig Cotter

          16/02/2016

          The reason for the restriction on loopbacks is actually poor coding on customer sites. A huge number of customers have set up poor or infinite loops on their site, that we’re blocking by default. If we enabled loopbacks across the platform, it would generate an enormous load across our shared servers, as many customer sites start looping back on themselves in an infinite cycle. Unfortunately (excluding Woocommerce here as it’s quite a decent piece of software) there are a lot of plugins that recursively loop back unnecessarily due to lazy coding and hit servers harder by a factor of 5-10. If customers have a sensible reason to enable loopbacks (Woocommerce is completely legitimate) we’ll enable, we may reject some requests however. I’ll have this codified so there’s little opportunity for confusion in future. If we could turn on loopbacks with zero impact on our platform, we’d do it right away. It tends to be primarily our older servers (with older sites on) that contain the majority of the ‘bad’ loopbacks, and we can’t turn it on for some servers but not others – uniformity is very important for a platform of this size.

           
      • Rich

        16/02/2016

        Craig – I have asked for loopback enabled and been repeatedly denied – i am MOVING AWAY! I have lost a significant new client which was worth £10k pa to me, due to this, as they were trying to email me and getting bounce backs due to lack of availblilty 🙁

         
        • Craig Cotter

          16/02/2016

          That’s our loss and I’m sorry to see you go, but I understand.

           
  • John

    16/02/2016

    Most of it has been said by other people but I would like t say the following.
    Well done and thank you for a comprehensive explanation and full support over the incident.

     
  • DanX

    16/02/2016

    Obviously a horrible week, but a very good response. Thanks for the info. I directed several of my customers to your status update pages, which they could then check later for further updates. Good blog above.

     
  • 16/02/2016

    I appreciate the thorough followup

     
  • 16/02/2016

    I appreciate your transparency about the downtime incident that took place recently. We are not seeking compensation from Heart Internet even though this latest incident has caused us to lose customers and even to receive an email threatening to take matters to court. Having spent a whole decade as a loyal reseller … better to call us business partners … of Heart Internet, we have grown with you and even introduced many clients to Heart Internet. All we are requesting is that you treat your resellers, your business partners as your equals with the same degree of speed and server performance as your own servers have. To put resellers on the same shared hosting with other clients is not fair or acceptable. We all know that Hostpay is extremely slow on a shared server when resellers are paying the equivalent of a monthly dedicated server. Also on another related matter, it is not fair that resellers / business partners are being made to pay £63.74 monthly for a basic dedicated server when it is available on HI for £49.00 monthly. Where is the fair and logical pricing mechanism in that? An equitable business relationship must be based on fairness. For proper business partner retention to continue these matters need to be addressed. Many thanks to your excellent support team.

     
    • Craig Cotter

      16/02/2016

      Hi Omar,

      Thanks for your comments. Reseller websites make up the majority of our shared hosting platform – separating Reseller sites would actually serve to speed up non-reseller sites rather than the other way around. Resellers do get a 25% discount on all of our new dedicated servers as well as the old ones – so you can get a bottom spec server for £36.75 p/m + setup.

      Hostpay should work on our Premium hosting platform, and we are currently looking at proper Reseller pricing for that platform that would allow you to host important sites (including Hostpay) relatively inexpensively.

       
  • John

    16/02/2016

    None of this remotely helps in the lost sales that have directly affected our business. Shutting the stable door after horse has bolted springs to mind. Look for alternatives.

     
  • James Simpson

    16/02/2016

    May I suggest a secondary DC with can take over in case of something similar?

     
    • Craig Cotter

      16/02/2016

      Hi James,

      We do have access to 2 additional datacentre halls in Leeds, and a number of datacentres in France, Germany and the US. We’re currently working out the best way to make Heart Internet more resilient in the future.

       
  • Jamie Snape

    16/02/2016

    Whilst it’s clear you are taking big steps to ensure this kind of event doesn’t happen again, it’s sad to see you feel you communicated well during the event.

    Resellers were left with no information to give their clients, no ETAs for restoration. Angry clients were calling over and over and we could tell them nothing.

    Seems you haven’t made the biggest change needed, that being to your communication.

    Finally could I suggest your own website is hosted elsewhere so your ticket system still works in the event your own servers go down again.

     
    • Craig Cotter

      16/02/2016

      Hi Jamie,

      I think the amount comms we put out were close to sufficient (although still fell short in some areas), but they were not directed well enough. Our immediate push is to ensure that all comms are directed (rather than requiring customers seek them out). We’re also preparing some communications templates detailing the best course of action to get back online depending on platform, this way we can immediately send them in case of disaster, rather than spend time writing them.

       
  • 16/02/2016

    To me it seems obvious for server manufacturers to just build in a small UPS into each server as a matter of course. Small battery offering enough power to facilitate a full shut down. Obviously I would say that, as a web75 sufferer.

     
  • 16/02/2016

    Thank you for this incredibly detailed and technical review of what happened.
    We appreciate the work you do to keep our sites running.

    Thanks
    Calum

     
  • Mark

    16/02/2016

    Just wanted to say well done guys. Always going to recommend you

     
  • 16/02/2016

    If only nothing in life ever went wrong but it does! Thanks for all your hard work it is very much appreciated

     
  • Angus

    16/02/2016

    We spent around 9 hours trying to figure out what went wrong with one of our client databases, we thought at first the site had been hacked. We lost an entire table from one of the WordPress Sites, and as you would have it, it was the option table. Whilst I appreciate that these things happen, and I would expect no less than ALL HANDS ON in such an incident as give that this is your business it is a pretty important incident that should never have happened.

    Given the time we had to spend getting our clients back online, I can not join in on the praise, this has caused reputational damage to ourselves and in turn Heart.

    In this digital age, email and communications are important to business, and downtime costs are high, and for businesses like ourselves supporting customers, that cost can not be recovered due to this not being the fault of the customer.

     
    • Craig Cotter

      16/02/2016

      Hi Angus,

      First of all, I’m glad to hear your client is back online, that’s the most important thing. Given the size of our unmanaged platform, we don’t have the number of highly technical staff required to respond in a meaningful way to this volume of server tickets. With the additional measures we’ve put in place, we should never have to deal with an issue on this scale again, but this does not decrease our desire to ensure that every customer is prepared for every eventuality. We’re planning on putting together a disaster recovery checklist/centre for everyone on VPS, Hybrid and dedis that ensures that everyone can back online quickly.

       
  • Ryan

    16/02/2016

    I appreciate the efforts you have gone to to resolve the issues you faced however we are quite a few days on now and it has cost us many hours of work trying to rectify the issues on our VPS’s and hybrid that were no fault of ours.

    We still have open tickets that have not been answered in several days and a clients site that is not fully functional due to one of the faults.
    We have lost one client due to the issues and could lose others if the issues are not dealt with swiftly.
    I think due to the amount of extra hours we have had to put in and the costs incurred to this and the loss of business then some sort of compensation should be given.

     
    • Craig Cotter

      16/02/2016

      Hi Ryan,

      If you’re a VPS customer, we should have emailed you directly on this matter.

       
  • Mat O'Marah

    16/02/2016

    I know this was an incident that could never have been predicted or planned for, but it has really highlighted how you hadn’t put in place the promised changes that you learnt in 2014. Have you ever listed any Planned Maintenance on the web hosting status page?

    Finally we have the new status page, that I notice seems to be getting a LOT of use this week. This is good, finally we can get updates *sent* to us when things happen, or it appears are even planned. This is a real step forward … but it shouldn’t have taken this long since it was promised so long ago.

    Also, I would like to see you proactively communicate with customers! We are always having to come to your websites, your status pages, your facebook page to find out what is going on. I believe it wasn’t until today the first email was sent out about this incident to customers.

    I know from my own experiences that when a disaster happens it is automatically all hands to fix. But your DR plan also needs to include a small number of individuals being assigned to communication. To ensuring the promised 20minute updates happen, to ensure social media is updated, and to ensure customers are emailed.

    I have lost customers because of this latest incident, and will now start to evaluate my own hosting provider / partner, and whether or not I stay with HI. Once all technical issues are resolved over the next few days, how you handle this and rebuild the relationship with us is critical!!

    Oh, but I have to give credit to your Facebook guy – done a great job at interacting with customers.

    Mat

     
    • Craig Cotter

      16/02/2016

      Hi Mat,

      I replied to another customer on this subject, but we’re currently looking at the option of automatically subscribing all of our resellers to status updates by default (or offering an easy opt-in). We want to discuss this with you, and Michael (our Facebook guy) is looking at kicking that discussion off in the next few days. It’s important that we move quickly following previous delays, it’s all well and good for us to put out updates, but if they don’t reach the right people, then it’s all for naught.

       
  • Mik Smiff

    16/02/2016

    Just a note to say that it is the first time I recall receiving an apology from a service provider for website or broadband outage since I loaded my first website in 1996.

     
  • André

    16/02/2016

    I’ve been with Heart Internet now for a few years and always found their support exemplary, to that end, I commend your efforts and transparency. Hopefully your customers will remember the good days and put the bad down to an educational opportunity, we were unfortunate to be in the UnManaged brigade, I have taken this as a “lesson learned” for us and consequently increased our own backup regime and services so we can respond along side Heart to keep our customers happy. Again, well done and keep up the good work.

     
  • 16/02/2016

    2 DATA HALLS
    Dear Craig – I have been a customer of Heart for over 10 years and in January at some cost I moved everything to Premium Hosting I am still unsure what to do. My customers are asking me to move, If i can tell them you are in a least 2 data halls (hopefully some miles apart) I might stay with you. When will you know the outcome of your investigation please?

     
  • Kecin

    16/02/2016

    I’ve worked in IT for many years and when things like this go wrong they really go wrong! This is a risk faces most companies (although I’d like to know how the safety system managed to override your UPS!)! What counts is how you recover and, given what a complex task this was, you have proved that you have the right technical expertise and a great team. I guess few companies experience this type of event so you should now have the experience to a) make sure it doesn’t happen again & b) have world-class recovery skills next time. What doesn’t kill you makes you stronger!

     
  • Tobi

    16/02/2016

    Outages happen and as always the Heart Internet team have been fantastic in handling this! Thank you for your up-front honesty and the level of detail provided in explaining why this issue occurred. Good luck with the continued recovery activities. I will be remaining a customer as I have been for the past 8 years.

     
  • Paul Scott

    16/02/2016

    Sounds like everything that could go wrong did go wrong. I think you all did a great job given the circumstances. I know that the knowledge gained from this event will make my sites more secure in the future.

     
  • Ian Gardner

    16/02/2016

    I have experienced many power failures on non-virtualised servers over the years and never had problems. The modern journal file systems recover themselves reliably and at most you lose the last uncommitted write.

    Is there something about virtualising servers that is bypassing the file systems, eg some kind of block level replication, that caused such chaos?

    As for MySQL, I have always been suspicious of a database that started life as a non-transactional single user db, postgres has always been absolutely bullet proof for me.

     
  • Adrian

    16/02/2016

    Thank you HI for a thorough explanation and (let’s all be honest, commenters), an apology that offers some form of recompense to those most affected.
    As a web76 customer, yes, it was a nightmare waiting for things to be sorted and communication left a lot to be desired. But Craig does look to be outlining measures that will address the issues everyone has suffered. I’m prepared to accept that HI will learn from this experience and move forward with them. I’m of the opinion that with these new measures, I’ll probably be safer staying with HI than chancing a similar scenario with an alternative provider who hasn’t learned the hard way and doesn’t offer some of the features and benefits HI soon will.
    BUT, that faith has to be rewarded, Craig – I’ll stick with you but I don’t want to be in this situation again any time soon…
    Thanks for the direct contact re. my web76 issues and keep up the good work – lots of things in life do go wrong but it’s the way you react, learn and move forward that makes the difference.

     
  • Terence Malaher

    16/02/2016

    Dear Craig, thanks for the note received in my inbox ref your problems and possible inconvenience to viewers of my The Testament of Truth web site having very important info for everyone globally.

    It is nice to see the main pages are still OK and I will need to look further (I assume) for I am not really conversant with what has occurred other than a power outage.

    Forgiveness is the ‘order’ of the day as is the paying off of ones karmic dues and I hope that other customers can hang in there because systems failures can happen to anyone and I see you chaps are simply doing your very best –

    Thanks, and as a customer all I needed was to be advised of the problems as you have done – being that I need to check my site and re-upload IT if needed –

    Keep UP the good service – Terence

     
  • Andrea

    16/02/2016

    Thank you for being honest and upfront about what happened. You are a very serious company and you are clearly aware that a crystalline level of communication with your clients is paramount. I’m very happy to have my website hosted by you and I will continue to do so in the future.

     
  • Daniel Baverstock

    16/02/2016

    Being a reseller, I like many others have been looking at alternative hosting companies, in between dealing with clients throwing their toys out the pram big time… A really difficult week ensued, as we do put the customer first, and have been doing all we can to maintain and retain their business.

    I find myself in a particular dilemma, as over the years 10+, I have found HI to be a sound platform, and the hosting has quite simply just worked (most of the time), even support and status has been very acceptable, up until last week….

    I’m not looking for compensation, as a few quid is irrelevant in the bigger picture… But looking at what other hosts offer (Free Offsite backups – some 4 times a day etc) I can’t help thinking that something like this at HI, would perhaps have solved many of the issues after the current downtime for all concerned, including HI themselves.

    Perhaps HI would consider offering their Resellers something along these lines… Our customers, who are also by default HI customers, really need a little TLC, a little security, a little confidence that lessons have been learnt, and most importantly, that all is being done to ensure this does not happen again.

    These kind of scenarios are a double edged sword for all concerned, and I can only imagine the pressure your tech guys have been under over the last week. We have all had to be on our best diplomatic behaviour, and maintain some karma.

    Hopefully the hard work has been done, and the dust is starting to settle, if we could go back to clients, with an improved armoury of service, and a few more shiny strings to our bow, I’m sure the past week or so, will be yesterday’s news.

     
  • Vernon

    17/02/2016

    The Universal law of Perversity is clearly alive and well in Leeds!

    My own experiences as an engineer show that there will always be incidents that even the best prepared folk will not have anticipated.

    For myself, it was an unfortunate coincidence that I was trying by phone to resolve an unrelated email problem for one of my clients when your servers went down taking all emails with it!

    Fortunately, I realised that the problem must lie with Heart and advised the client to wait 24 hours. This solved the problem – so thanks for working on it so fast and well.

    In your message you specifically mention Web 75-79. How can we tell on which server(s) our sites are hosted? I and my clients each have sites on your shared platforms.

     
  • Andres

    17/02/2016

    Hi guys, we have tried to manage clients anxieties as much as we can, but we were wondering how close are you to be back to normal. We are still experiencing internal server errors on some sites that connect to databases, and whilst we are patiently waiting our clients seem to be running out of patience.
    Keep us posted and hope we can get some updates on critical tickets soon.

     
  • Dave

    17/02/2016

    “As these servers are unmanaged, there is no disaster recovery process in place by default. I know this isn’t the answer many of you want to hear, and most of all we want to ensure that this can never happen to you again. All VPS hosts are now set to be far more resilient in the event of a sudden power loss.”

    How are they now more resilient? It’s still an unmanaged service.

     
    • Craig Cotter

      17/02/2016

      They are far more likely to survive a reboot of the host server with a much lower chance of data corruption.

       
  • 17/02/2016

    This is just so frustrating. Like several others on here who have commented about slow server performance prior to the power outage I have also been told ‘it’s my site that is the problem’

    Despite many tickets being raised and evidence given, Heart Internet are unable to identify or admit a problem. The problem is fairly clear – too many packages on the server which is overloading resources. It wasn’t a plugin that was causing a problem, nor was it a loopback either, it was sheer load on the server. This was proved by me replicating the package on another Heart server and the speed was tremendously faster. Now, as a result of Heart continuously telling me that there is nothing they can do and a shared platform ‘is what it is’ I’ve had no choice but to move the affected package to a new provider. The speed now is hugely improved. Prior to the power outage, we would enjoy page loads of approx 2 seconds for the WordPress dashboard to load, after the power outage it took way over 10-15 seconds (sometimes it would disconnect)

    Also, I think it’s pretty bad that Heart are suggesting this is their biggest issue in history, perhaps there is a new team in place as we certainly haven’t forgotten the issues from 2014 when the datacentre move knocked off the email network for several days.

     
    • Craig Cotter

      17/02/2016

      Hi William,

      Thanks for the update. We’re running benchmarks across our whole shared platform right now to see if any (and how many) servers are performing below standard. This will tell us if there are any network or database connectivity issues, I’m working with our sysadmin team on that right now. I’ll drop you an email with an update when I have one.

       
  • Zach Ashton

    17/02/2016

    I just want to commend the people over at Heart Internet. They fixed the issue and helped us endure as little downtime as possible. Thank you guys a billion!

     
  • Geoff Bryan

    17/02/2016

    Many thanks for the full explanation – it is much appreciated and I know that your team worked tirelessly to get things resolved.

    Although none of my sites or servers were directly affected beyond the initial couple of hours, one major effect of the outage could have easily been prevented. For the whole period while your infrastructure servers were down there appeared to be no name servers available, so whilst I could RDP into my dedicated and hybrid servers using their IP addresses, and both were running fine, none of the sites running on them could be resolved from a URL.

    In addition to this I have a number of customers using a third party Hosted Exchange email service where in the absence of a running name server no incoming mail was able to find the correct mx mail server.

    A simple change to have the 2 default name servers in different data centres would have prevented all these extra issues and I’m pleased that that is one option you are planning to implement.

     
  • Manninagh Dooie

    17/02/2016

    You have handled a disaster which could have been a far worse outcome with absolute professionalism and above all Openness and Honesty, which is refreshing and encouraging approach in these days of spin and “double-speak”. Thank you for this blog and your effective communications.

     
  • Jon Perkins

    17/02/2016

    My VPS has several SQL Server databases and two of these were corrupted by the sudden shutdown. Fortunately I wrote an automated backup process of my own for each database so I was able to restore the databases without too much data loss (nothing that caused any major issues). It did lose me the entire afternoon and working until 11:30 that night to fix and be sure that everything else is sound, but I got there.

    Certainly this should never have happened in the first place, but a sincere “well done” to you all for getting through the massive exercise in logistics and triage that this would have entailed. I realise that others are affected much more severely than I was so they will understandably be left in a different frame of mind but well done to you all for enduring such a rough time on the entire staff. I will be staying with Heart because my overall level of satisfaction is still high and I do recognise that disasters sadly happen from time to time (but rarely I hope).

     
  • Mel Launder

    17/02/2016

    Thank you for you comprehensive update. I didn’t envy your situation at all, although like yourself and I am sure others, found the situation and the wait very frustrating. I don’t have any problems with outages as long as the customer service notifications are frequent and to date yours always have been. As I had a few websites on the last few restored servers the wait was longer than I liked.
    I also have noticed that over the last couple of years there has been an increase in outages and this concerns me. However. During this episode I feel there was (and still is) lack of support and notifications for the Resellers. I feel you need to discuss ways of strengthening the connection and communication between yourselves and Resellers. So we can intern can try and keep our customers a little happier during these issues.
    Maybe some form of email/text notification if one of our sites are down? With further continued direct re-sellers email/text updates so we can keep our customers informed.

     
    • Craig Cotter

      17/02/2016

      Hi Mel,

      Subscribe to our new status page at status.heartinternet.uk – you’ll get text or email updates every time we report an incident. We’ll be putting this in the header of our site shortly.

       
  • Mike Barber

    17/02/2016

    Hi,

    Sounds like a nightmare you all did well to overcome it!
    I seem to remeber that just prior to the power problem there was a large ddos attack can this be excluded as a cause or contributor to the power problem?
    If so how can you be so certain?

    Cheers
    Mike
    Customer

     
  • Phil

    17/02/2016

    Thanks for the explanation and although affected by the outage, it hasn’t impacted on my business as much as it has others so I’m not going to rant and rave about that here. What I would like to raise is the inability of your support staff to react to replies to tickets that you raise yourselves regarding accounts that are taken offline due to bad scripts or hacks etc.

    Why does it take you 4 hours (and counting) to reinstate an account that you took offline at 11:30 this morning when I replied to your ticket within minutes with a fix. My customer is still without his website and I’m looking like a fool thanks to your inadequate monitoiring of tickets that you raise yourselves.

    Why would my reply sit in a queue for 4 hours (and still counting) when you raised it with me?

    Why are you unable to track these tickets differently and then act upon them when replies are received?

    I appreciate entirely that your shared hosting service has to be monitored and it’s integrity maintained and therefore you would be wrong to let rogue scripts or insecure accounts exist when found, but when I, your customer responds immediately to the problem raised and you then take a further 4 hours (and still counting) to reply that can’t be right, can it?

     
    • Craig Cotter

      17/02/2016

      Hi Phil,

      Throughout today we have been prioritising tickets where customer sites are down (through deactivation or DNS), but wait times are still high. If we have not already, we should get to you soon.

       
  • Nigel Revill

    17/02/2016

    Like many resellers here i’m sure, my mobile and office phone had an extremely busy day. I found the status page being updated very useful which allowed me to keep customers up to date.

    I have been with HI since around 2003 and have no plans to look elsewhere for reseller accounts even after last week events. Yes some of my customers were extremely annoyed especially the ones one 75 – 79 but things can go wrong sometimes and I will look after these customers when their renewals come up as I don’t want to lose them.

    I am sure HI will put things into place to avoid things like this happening again and by the sounds of things you already have started.

     
  • 19/02/2016

    Excellent detail on this outage and all round good communication on the Service Status pages, meaning I didn’t need to contact Support when my site was down. On a personal note, I saw good recovery time thank you.

     
  • Gordon Jeffrey

    22/02/2016

    Things happen but I was impressed by your honesty. Well done.

     
  • Chris Power

    23/02/2016

    Stuff happens. A worrying time for everyone, especially yourselves, but you dealt with it. Above all, you kept us informed at every stage. Well done to all concerned!

     
  • Chris Jones

    24/02/2016

    Thanks for the explanation. As a small reseller of 7+ years I have been very happy with the service. This latest incident has only lost us 2 clients but most (of the good ones) seem to be happy to stay with us. Hope it doesn’t happen again though.

     
  • 25/02/2016

    I have about 20 websites with Heart – all shared hosting.

    When I suddenly ‘lost’ the website I was working on, and checked, and found, that I lost others, I suspected that there had been a major problem.

    I did seriously consider phoning or emailing Heart Support, but, I had the presence of mind of check the Heart System Status Page first.

    Bingo! Major problem confirmed!

    The last thing I was going to do was then add to the load of your Support Team, realising that all of Heart’s resources would be working flat out to recover the situation.

    I say that ‘stuff’ happens and it is clear that Heart are already planning another server hall, so an added level of resilience.

    I think that Heart have handled this major outage extremely well. They have kept us informed, every step of the way.

    Not good that this major outage happened, but ‘excellent’ Heart’s response to a major problem, and keeping your customers informed!

     
  • John Bob

    27/02/2016

    I think Heart have done very well to recover from this. Yes there are always improvements to be made but I do want to say this to those resellers who aren’t satisfied and complaining and its quite blunt.

    1) Regardless of who you host your services with, these things happen. Even the likes of Google and Microsoft have had major outages. It is the very nature of IT and there is no single business on this planet that can promise 100% uptime in all scenarios – it is not possible.

    2) For those who are running critical sites or services for your customers then you should think of resilience and DR towards Heart, i.e. what happened if they failed. Heart is a component that delivers your service to your customers and like any other service in the world be it electricity, water, gas, etc, can fail, therefore it is your lack of planning to deal with a situation like this. I resell an awful lot of services through Heart but in my T&Cs it clearly states that all services are not 100% uptime guaranteed and I even go the extra mile of telling every customer that including during yearly renewals. Some potential customers have walked away because of that but I’d rather be honest than promise something I can’t guarantee to deliver.

    3) If you feel you could provide better performance, resilience and technical expertise for less than a 4 figure sum per year then you should do it yourself. Remember what you get for your monies worth. If you provided a similar solution, self hosted, then you’d be paying 6 – 7 figure sum to achieve the same. Think about the servers, physical location, UPS, generators, air con, connectivity, staff, licensing, the list goes on. It is insanely expensive to achieve the same level of service if you did it yourself.

    4) For those who use VPS or dedicated and have corruptions, it is not Heart’s responsibility to provide backups UNLESS you paid for that option and even then you should expect data loss as backups are only done at scheduled times not real time. Its quite clear. If you have a PC at home, don’t back it up and it goes pop then its your fault, clear and simple. If you have a need to failover the service with 0% data corruption then you should have done that, be it hosting a 2nd server with another provider and doing VPN link OR the preferred method would be hosting your own servers in dedicated data centres where you have failovers to other data centres.

    Whilst a 2nd data hall would be great, surely this will resort increasing the reseller and all other packages. For those who say this should have been done beforehand then think about striking the balance of cost and the possibility of this happening.

    Heart have done an excellent job of recovering what is a massive technical challenge. I hope for those resellers who are complaining realise this and now onwards put their own DR plan in place – Heart are not responsible for your business

    Well done again Heart!

     
  • 27/02/2016

    I have been with Heart for around 10 years or so now, and have never had any major issues, this is the first and caused a few problems. But seriously 10 years and one major problem for the business that caused me a few problems. I only wish every company I deal with had the same record. Hope you get it all sorted and thank you for the updates, for me that’s what makes Heart stand out from the rest – Communication..

     
  • Steven McDonald

    01/03/2016

    A nightmare situation for sure. I think you have proved you did your best here and I am happy that all these procedures are in place and being updated to minimise future events such as this.

     
  • 01/03/2016

    Hi Craig

    Although it’s clearly annoying and unwelcome to have our site go down, I’d just like to record our praise for your communication and efforts to get everything working again so quickly.

    We’ve had nothing but good service and reliability from HI and understand that at times serious things happen. The world ins’t a perfect place and you seem to have good future planning in place to be able to cope better in future, so I’m more than happy with the service.

    You should also be praised for transparency and honesty – I’ve used and continue to use multiple companies for website hosting and can easily say I’ve never had any of them admit mistakes (or even apologise in many instances), let alone blog about them, so well done for taking this step.

    We’ll happily be renewing our contract when it’s due.

    Many thanks
    Andrew (on behalf of the Learning Innovation team at The Open University)

     
  • Ian

    01/03/2016

    Many just expect everything just to “work”. This perhaps highlights the fact of life that many things – especially technology – sometimes don’t work. I have to say this incident, although hard for customers and provider alike, perhaps is a blessing. I spent almost 10 years in the military and one thing is certain. You can run as many “what if” scenarios and training sessions as you like, but it’s only when the &*(% hits the fan that you really know what you’ll do, if your disaster plans work, if your infrastructure is robust, etc etc.

    Well done for keeping us all in the loop and communicating. Many would just have buried their head in the sand and pretended that it was someone else’s problem. Respect and hope things get back to normal for you soon. 🙂

     
  • 10/03/2016

    The joys of working with technology! Glad you handled it well and it was resolved quickly. Luckily I have minimal clients on my VPS so no damage caused. Can’t say I saw an email infoming me of the outage tho. Would have been usful to have a head-up before the phone starting ringing with angry customers.
    Anyway, well done for resolving it so quickly.

     
    • Kate Bolin

      11/03/2016

      Hi Gareth,

      Sorry to hear you didn’t get an email informing you. We sent one out on the 16th to all customers.

      Now, however, if you sign up to receive notifications for https://www.heartstatus.uk/, you’ll have a heads-up through email, RSS, or SMS as soon as something happens.

       

Comments are closed.

Drop us a line 0330 660 0255 or email sales@heartinternet.uk