77TB of Research Data Lost Because of HPE Software Update
Some updates are better to avoid.
Kyoto University has lost a massive 77TB of critical research data from its supercomputer because Hewlett Packard Enterprise (HPE) issued a software update that caused a script to malfunction and delete backup data. As a result, days of work are gone, and a significant part of the wiped-out data is lost forever.
Kyoto University lost about 34 million files from 14 research groups generated from December 14 to December 16, according to The Stack. GizChina reported that the university could not restore the data from four groups by backup and therefore is gone forever. Initially, specialists from Kyoto thought that the university lost up to 100TB, but it turned out that the limit of the disaster was 77TB of data.
HPE pushed an update that caused a script that deletes log files that are more than ten days old to malfunction. However, instead of deleting old log files stored along with backups in a high-capacity storage system, it wiped out all files from the backup instead, erasing 77TB of critical research data.
HPE admitted that its software update caused the problem and took 100% responsibility.
"From 17:32 on December 14, 2021 to 12:43 on December 16, 2021, due to a defect in the program that backs up the storage of the supercomputer system (manufactured by Japan Hewlett Packard), the supercomputer system [malfunctioned]," a statement by HPE translated by Google reads. "As a result, an accident occurred in which some data of the high-capacity storage (/LARGE0) was deleted unintentionally. […] The backup log of the past that was originally unnecessary due to a problem in the careless modification of the program and its application procedure in the function repair of the backup program by Japan Hewlett Packard, the supplier of the super computer system. The process of deleting files malfunctioned as the process of deleting files under the /LARGE0 directory."
The team has suspended the backup process on the supercomputer. Still, Kyoto University plans to resume the backup by the end of January after fixing the software problem and the script and taking measures to prevent a recurrence.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.
-
Alvar "Miles" Udell USAFRet said:What part of "offline backup" was unclear?
Saddest thing is it sounds like they haven't learned that lesson yet.
HP Supercomputer System Caused 77TB Data Loss At Japan's Kyoto Uni (gizchina.com)
Since it became impossible to restore the files in the area where the backup was executed after the files disappeared, in the future, we will implement not only the backup by mirroring but also an enhancement such as leaving the incremental backup for some time. We will work to improve not only the functionality but also the operation management to prevent a recurrence.
-
InvalidError
Offline backups won't save you when it is your broken backup script that is deleting files instead of actually backing them up.USAFRet said:What part of "offline backup" was unclear? -
-Fran- Sometimes people at the office whine a lot about following "due process" when moving things into Live/Production environments; specially new people (think grads) and "cowboys" that come from small companies. This is the reason why there's people second guessing your work (in a good way) and asking questions about what you're doing and if you're 150% sure you understand what it is you're doing. As sad as it is, this is a good reminder that you always have to question anyone, even vendors, when they say "I have to do something in your system".Reply
To all you people part of SysOps and Development that hate filling forms and going to review meetings, this is why due process exists within Companies; specially big ones.
Regards. -
USAFRet
True.InvalidError said:Offline backups won't save you when it is your broken backup script that is deleting files instead of actually backing them up.
Obviously, multiple layers of brokenness.
It just weirds me out...every day, we are admonished to back up your data, make good passwords, good browsing habits...
And then, the major companies you entrust your data and info to....screw it up. -
derekullo Even if you delete data your snapshots should still have the data in them.Reply
I can only assume they weren't keeping snapshots. -
dalauder
It only says "days of work are gone" from December 14th to 16th. So it sounds like they didn't actually lose that much, they just store WAY too much data. Seriously, what research system puts 30TB in permanent storage per day?derekullo said:Even if you delete data your snapshots should still have the data in them.
I can only assume they weren't keeping snapshots. -
InvalidError
That only works when the backup script or software responsible for creating snapshots or whatever backup strategy they were using is actually doing its job as intended instead of destroying the files it was meant to preserve.derekullo said:Even if you delete data your snapshots should still have the data in them.
I can only assume they weren't keeping snapshots.
They got screwed over by a buggy backup script. Their data would likely have been fine if they hadn't attempted to back it up with the "updated" backup script that ended up destroying two days worth of data before they realized something went wrong. -
Alvar "Miles" Udell dalauder said:It only says "days of work are gone" from December 14th to 16th. So it sounds like they didn't actually lose that much, they just store WAY too much data. Seriously, what research system puts 30TB in permanent storage per day?
According to the Gizchina article, only 4 groups were not recoverable. If I understand the article correctly, the 77TB of files include all 14 groups, so the actual loss may be a small fraction of that.
RANGE OF INFLUENCE OF FILE LOSS
Target file system: / LARGE0
File deletion period: December 14, 2021 17:32-December 16, 2021 12:43
Disappearance target file: December 3, 2021, 17:32 or later, Files that were not updated
Lost file capacity: Approximately 77TB
Number of lost files: Approximately 34 million files
Number of affected groups: 14 groups (of which 4 groups cannot be restored by backup)