Skip to main content

77TB of Research Data Lost Because of HPE Software Update

HPE
(Image credit: HPE)

Kyoto University has lost a massive 77TB of critical research data from its supercomputer because Hewlett Packard Enterprise (HPE) issued a software update that caused a script to malfunction and delete backup data. As a result, days of work are gone, and a significant part of the wiped-out data is lost forever.

Kyoto University lost about 34 million files from 14 research groups generated from December 14 to December 16, according to The Stack (opens in new tab). GizChina (opens in new tab) reported that the university could not restore the data from four groups by backup and therefore is gone forever. Initially, specialists from Kyoto thought that the university lost up to 100TB, but it turned out that the limit of the disaster was 77TB of data. 

HPE pushed an update that caused a script that deletes log files that are more than ten days old to malfunction. However, instead of deleting old log files stored along with backups in a high-capacity storage system, it wiped out all files from the backup instead, erasing 77TB of critical research data.

HPE admitted (opens in new tab) that its software update caused the problem and took 100% responsibility.

"From 17:32 on December 14, 2021 to 12:43 on December 16, 2021, due to a defect in the program that backs up the storage of the supercomputer system (manufactured by Japan Hewlett Packard), the supercomputer system [malfunctioned]," a statement by HPE translated by Google reads. "As a result, an accident occurred in which some data of the high-capacity storage (/LARGE0) was deleted unintentionally. […] The backup log of the past that was originally unnecessary due to a problem in the careless modification of the program and its application procedure in the function repair of the backup program by Japan Hewlett Packard, the supplier of the super computer system. The process of deleting files malfunctioned as the process of deleting files under the /LARGE0 directory."

The team has suspended the backup process on the supercomputer. Still, Kyoto University plans to resume the backup by the end of January after fixing the software problem and the script and taking measures to prevent a recurrence.

Anton Shilov
Anton Shilov

Anton Shilov is a Freelance News Writer at Tom’s Hardware US. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

  • USAFRet
    What part of "offline backup" was unclear?
    Reply
  • Alvar "Miles" Udell
    USAFRet said:
    What part of "offline backup" was unclear?

    Saddest thing is it sounds like they haven't learned that lesson yet.

    HP Supercomputer System Caused 77TB Data Loss At Japan's Kyoto Uni (gizchina.com)

    Since it became impossible to restore the files in the area where the backup was executed after the files disappeared, in the future, we will implement not only the backup by mirroring but also an enhancement such as leaving the incremental backup for some time. We will work to improve not only the functionality but also the operation management to prevent a recurrence.
    Reply
  • InvalidError
    USAFRet said:
    What part of "offline backup" was unclear?
    Offline backups won't save you when it is your broken backup script that is deleting files instead of actually backing them up.
    Reply
  • -Fran-
    Sometimes people at the office whine a lot about following "due process" when moving things into Live/Production environments; specially new people (think grads) and "cowboys" that come from small companies. This is the reason why there's people second guessing your work (in a good way) and asking questions about what you're doing and if you're 150% sure you understand what it is you're doing. As sad as it is, this is a good reminder that you always have to question anyone, even vendors, when they say "I have to do something in your system".

    To all you people part of SysOps and Development that hate filling forms and going to review meetings, this is why due process exists within Companies; specially big ones.

    Regards.
    Reply
  • USAFRet
    InvalidError said:
    Offline backups won't save you when it is your broken backup script that is deleting files instead of actually backing them up.
    True.
    Obviously, multiple layers of brokenness.

    It just weirds me out...every day, we are admonished to back up your data, make good passwords, good browsing habits...
    And then, the major companies you entrust your data and info to....screw it up.
    Reply
  • derekullo
    Even if you delete data your snapshots should still have the data in them.

    I can only assume they weren't keeping snapshots.
    Reply
  • hotaru251
    This is why updates shouldnt be automatic.
    Reply
  • dalauder
    derekullo said:
    Even if you delete data your snapshots should still have the data in them.

    I can only assume they weren't keeping snapshots.
    It only says "days of work are gone" from December 14th to 16th. So it sounds like they didn't actually lose that much, they just store WAY too much data. Seriously, what research system puts 30TB in permanent storage per day?
    Reply
  • InvalidError
    derekullo said:
    Even if you delete data your snapshots should still have the data in them.

    I can only assume they weren't keeping snapshots.
    That only works when the backup script or software responsible for creating snapshots or whatever backup strategy they were using is actually doing its job as intended instead of destroying the files it was meant to preserve.

    They got screwed over by a buggy backup script. Their data would likely have been fine if they hadn't attempted to back it up with the "updated" backup script that ended up destroying two days worth of data before they realized something went wrong.
    Reply
  • Alvar "Miles" Udell
    dalauder said:
    It only says "days of work are gone" from December 14th to 16th. So it sounds like they didn't actually lose that much, they just store WAY too much data. Seriously, what research system puts 30TB in permanent storage per day?

    According to the Gizchina article, only 4 groups were not recoverable. If I understand the article correctly, the 77TB of files include all 14 groups, so the actual loss may be a small fraction of that.

    RANGE OF INFLUENCE OF FILE LOSS
    Target file system: / LARGE0
    File deletion period: December 14, 2021 17:32-December 16, 2021 12:43
    Disappearance target file: December 3, 2021, 17:32 or later, Files that were not updated
    Lost file capacity: Approximately 77TB
    Number of lost files: Approximately 34 million files
    Number of affected groups: 14 groups (of which 4 groups cannot be restored by backup)
    Reply