Heart Internet would like to apologise to those customers affected by an outage on a portion of our shared and premium hosting platforms.
At 13.00 on Tuesday, 17 January, our systems engineers identified a problem with some of our customers’ websites, caused during a routine software update to a number of core packages on our servers.
During this update, a technical fault led to some websites on some servers becoming unreachable. This was not a fault with the hardware or the data centre, but an issue with the software that affected each server differently.
As the update was a routine maintenance update as part of our commitment to continually improve website performance, it took place during normal business hours. This update was also tested on our staging platform, and had worked without any problems.
Once we restored the servers, there were further connectivity issues between the web servers, NAS drives, and the database servers. Our team attempted to automate the recovery process with scripting, to ensure a fast recovery time, but as the process continued, we realised that we could not automate the restoration process.
We wanted to ensure there was no data loss or server corruption, and our system administrators personally checked each server and website to ensure all data and content was present and running correctly. This caused the extended time delay in restoring all the websites.
We are still synchronising some of the NAS drives, which may cause intermittent problems with high load on some servers.
We are fully monitoring connectivity, software, and hardware within the data centre. We are now building into our update process the requirement that all updates will undergo staged rollouts at a slower pace, even if they have been tested successfully on staging and live platforms.