On Wednesday 10th February, we suffered an interruption in power to our data centre facility during emergency works to fix a fault. This has caused disruption to customers’ services and our teams are working to resolve remaining issues.
We want to be open and transparent with our customers and give you the causes of the fault and take this opportunity to reassure all of our customers of our commitment to them.
Our data centre
Our UK-based data centre facility is one of the most efficient and resilient data centres in Europe. We carry out tests on a regular basis and test our generators every weekend to make sure that everything is in order. Additionally, we have expert technicians staffing our data centre 24/7, every day of the year. We have also have two UPS systems serving each data centre hall.
So, what happened?
- On Wednesday, one of datacentre halls suffered a power loss which affected the facility for less than 9 minutes.
- Each data centre hall has two UPS (uninterruptable power supplies) which feed into a LTM (load transfer module), which manages the feed of power from the UPS to the datacentre hall, where your servers are housed.
- This piece of hardware automatically switches the power between the two external supplies, should one fail. This is part of our redundancy commitment to you.
- The LTM showed a fault on the primary power supply and was running on its backup. Our teams followed the guidelines and contacted the manufacturer and an expert in this piece of hardware was sent to our facility to investigate and fix.
- The fault code indicated that there was a problem with the voltage monitor. As a safety procedure, this automatically shuts down primary power, even though the power supply itself is likely fine.
- The engineer on site assisted with the fitting of the replacement part. The procedure was then followed to turn off the already disabled switch and to change the part safely. Unfortunately, a safety mechanism in the device triggered incorrectly, which led to the data centre losing power.
What was at risk?
The safety of our teams is always our primary concern. Our data centre uses the same power as 150 homes all watching TV, with the heating on and the kettle boiling all at once. While a top priority is to keep your websites and services online, our highest priority is the safety of our engineers. When working with this kind of equipment, we have to ensure that everybody is safe. While we are extremely frustrated that the device triggered out incorrectly, a false positive is better than a risk to the life of our engineer. Finally, if the LTM is bypassed whilst it is worked on, there is a risk of a power surge that could irreversibly damage the contents of the datacentre.
Will this happen again?
A fault of this nature is almost unheard of and we will be working with the manufacture of our data centre hardware to ensure that this cannot happen again. We will, of course, keep you updated on any changes following our reviews.
We would like to apologise to all customers who have been affected and we appreciate your patience and we understand your frustrations. Our teams have been working around the clock to resolve all remaining issues, and will continue to until everything is restored to our normal high level of service. For regular updates from our support team, please visit: https://www.webhostingstatus.com/