Catastrophic Hardware Failure

Incident Report for BLE

Postmortem

On Sunday the 6th of September a power surge at our Data Centre caused both the system’s main hard drive, and back-up drives to fail over multiple servers. Expert data recovery specialists have worked on retrieving as much back-up data as possible. Their report states that all data up until the 6th of May 2020 has been fully recovered, any information added to the system after that date has been corrupted and is non-retrievable.

We can confirm that the catastrophic Hard Drive Failures is not related to any data breach and is something out of our and the data centres control and unexpected.

If you have an account on the system, or you are a manager who has added staff to the system since 6th of May 2020, then you will need to re-register. If you have any training records or certification within this period, this will need to be completed again.

If you have any queries about your account or the recent incident, please raise a support ticket either through your account once logged in, or through the support ticket button on the portal login screen and we will endeavour to respond to you as soon as possible.

Posted Sep 15, 2020 - 12:56 BST

Resolved

The system is back online and emails will be sent out to core users. Marking as resolved.

Posted Sep 14, 2020 - 12:28 BST

Update

The website will be back online Monday (14th) at 12pm (noon).

Posted Sep 11, 2020 - 16:15 BST

Monitoring

Hello,

The investigation of the SSD failure's have been completed and the results show that the drives sadly were not possible to recover in its current state and system backups are the best resort for getting the system back online.

Note from recovery company:

Our experience is that this failure mode is typically observed when context information, which contains firmware critical to data recovery and is unique to every drive, is not saved correctly by the SSD controller to NAND when a power loss occurs; usually when the drive is actively writing data. We're actively speaking with our current service provider to investigate the power loss possibility.

Posted Sep 11, 2020 - 16:14 BST

Update

The team working on our drives have come back confirming an issues with our SSD's controller, this is what handles what data goes where.
They're requesting until tomorrow to complete diagnostics that need extra resources.

We'll be chasing updates tomorrow to get the results as soon as possible.

Posted Sep 09, 2020 - 12:59 BST

Update

A small update but an update none the less. We're expecting the results for the drives at lunchtime today, once we have these results we have our two main data set options and will share these details with you and along with that our timeframe for the site being open again.

Posted Sep 09, 2020 - 08:53 BST

Update

The team has come back confirming the Hard Drives have failed controller chips and would need to go off to another team to be completed. The drives should be on the second location in the next couple of hours and should have the word of the main results within the evening.

Our team has completed major amounts of recovery and we're just waiting on the above details before releasing the system to ensure what we provide is of the best possible and latest of data.

To be clear; No data has been breached, we have come to the understanding a power surge broke the drives and this was out of our control.
We're working late into the night and hope to come back with a result of the checks this evening or early morning.

Posted Sep 08, 2020 - 14:04 BST

Update

While our communication line hasn't been open overnight, we're been working throught to get the system back online.
We are currently awaiting an update on the drives and will be bumping it this morning. At 9am the Development Team and Project Managers had a meeting to confirm todays plans and placing our other systems online. Once we have details about the drives we'll update here.

Getting learners back into the system is very important for us and we're working non stop to get this done.

Posted Sep 08, 2020 - 09:27 BST

Update

While the system remains offline and we allow for our team to review finals options with the data backup options we have including allowing our team to test the failed drives, the website will show as being down but with the option to see this status page.

We hope to have results of our data options back as soon as possible.

Posted Sep 07, 2020 - 16:22 BST

Update

While the BLE remains offline, our main websites are now online.

The failed hard drives have been sent to a company to look at further recovery options as this will allow us to get up to the second the machine went offline.

For now, the BLE will remain offline to correct Database tweaks and evaluate data changes.
Account Managers will be reaching out to everyone in due time and we aim to update this area as much as possible.

Posted Sep 07, 2020 - 15:18 BST

Update

Websites including our Main websites, BLE and other small projects are currently offline.

We've been in the process of moving systems from our current provider to owned hardware recently and this hardware failure comes at a terrible time. We are currently in the process of restoring the system but have no ETA to provide.

Short; The system is offline while working to repair hardware, when we have updates they will be posted here.

Long; Our system has been around for years, although we're always working to rework legacy code with the amount of traffic and systems that run at once mean our SSD's tend to work heavier than a typical project. The system's core runs in whats called "RAID 1" this means data from one SSD (storage drive) is mirrored onto the other and if one fails, the other can recover without a problem. On Sunday morning both drives failed at the same time and this caused the whole machine to go offline.

For the more techy of individuals, our current breakdown log is below:

06/09/2020:
10:00am - We were alerted of our machines being down by clients due to it being Sunday.
10:22am - With our team unable to control our servers contact was made to our current machine provider.
10:48am - Our provider confirms they'll be sending someone to our machines within the DC to view.
17:35pm - Machines came back online but dropped a couple of minutes later due to an unknown issue.
18:34pm - Attempted to correct to an internal console to correct SSD issues. machine came online but dropped shortly after.
19:56pm - Unable to boot the machine with current drives in place, the SSD's had failed without warning.
21:19pm - On site team placed two new hard drives into the machine to begin rebuild.

07/09/2020:
6:00am - On site team picked up the ticket to see if the old drives could be merged over to save time and some data.
8:45am - Main team now picking up rebuild tasks, currently planning rebuild priorities.

Posted Sep 07, 2020 - 09:03 BST

Identified

We're currently working to replace failed hardware hosting our main systems. Details to follow.

Posted Sep 07, 2020 - 00:03 BST

This incident affected: Learning Management System (Website, Email Processing, Shopping Cart Processing) and Psittacus Systems Services (Email Hosting Services, Website Hosting Services).