Meta staff torrented nearly 82TB of pirated books for AI training — court records reveal copyright violations

Mark Zuckerberg on stage
(Image credit: CNET/YouTube)

Facebook parent-company Meta is currently fighting a class action lawsuit alleging copyright infringement and unfair competition, among others, with regards to how it trained LLaMA. According to an X (formerly Twitter) post by vx-underground, court records reveal that the social media company used pirated torrents to download 81.7TB of data from shadow libraries including Anna’s Archive, Z-Library, and LibGen. It then used this information to train its AI models.

The evidence, in the form of written communication, shows the researchers’ concerns about Meta’s use of pirated materials. One senior AI researcher said way back in October 2022, “I don’t think we should use pirated material. I really need to draw a line here.” While another one said, “Using pirated material should be beyond our ethical threshold,” then they added, “SciHub, ResearchGate, LibGen are basically like PirateBay or something like that, they are distributing content that is protected by copyright and they’re infringing it.”

Did Meta pirate eBooks?

(Image credit: Future)

"Torrenting from a corporate laptop doesn’t feel right" - Meta employee

Then, in January 2023, Mark Zuckerberg himself attended a meeting where he said, “We need to move this stuff forward... we need to find a way to unblock all this.” Some three months later, a Meta employee sent a message to another one saying they were concerned about Meta IP addresses being used “to load through pirate content.” They also added, “torrenting from a corporate laptop doesn’t feel right,” followed by laughing out loud emoji.

Aside from those messages, documents also revealed that the company took steps so that its infrastructure wasn’t used in these downloading and seeding operations so that the activity wouldn’t be traced back to Meta. The court documents say that this constitutes evidence of Meta’s unlawful activity, which seems like it’s taking deliberate steps to circumvent copyright laws.

However, this isn’t the first time an AI training model has been accused of stealing information off the internet. OpenAI has been sued by novelists as far back as June 2023 for using their books to train its large language models, with The New York Times following suit in December. Nvidia has also been on the receiving end of a lawsuit filed by writers for using 196,640 books to train its NeMo model, which has since been taken down. A former Nvidia employee blew the whistle on the company in August of last year, saying that it scraped more than 426 thousand hours of videos daily for use in AI training. More recently, OpenAI is investigating if DeepSeek illegally obtained data from ChatGPT, which just shows how ironic things can get.

The case against Meta is still ongoing, so we will have to wait until the court releases its decision to say if the company committed direct infringement. And even if the writers win this case, Meta, with its huge financial war chest, will likely appeal the decision, meaning we will have to wait for several months, if not years, to see the final court judgment.

Jowi Morales
Contributing Writer

Jowi Morales is a tech enthusiast with years of experience working in the industry. He’s been writing with several tech publications since 2021, where he’s been interested in tech hardware and consumer electronics.

  • Gururu
    Of course they did. And of course they will get away with it.
    Reply
  • Shiznizzle
    Did they think they could get away with it?

    Yes, they did. And they will get away with it as the fine will be the cost of doing business and not even be a fraction of %1 of one hours worth of profits.

    That is if they get fined to begin with. They will laugh at the consequences.

    Not in million years will they face the full penalty for each individual infraction. That would shut them down.

    My guess is the government will make a deal with them. Share the data and we will slap you o the wrist.

    Will they go after googles' Youtube with the same fervor?
    Reply
  • COLGeek
    Aside from this particular case, one of the massive complaints regarding LLMs is the siphoning of copyrighted content. One consequence of this has been the pay-walling of many news sources, further diluting the gene pool of actual news vice algorithm enhanced garbage.

    Consumers need to be aware of what they are actually consuming.
    Reply
  • Pierce2623
    This is the behavior you guys are fundingwith your constant use of “IG”. Bravo.
    Reply
  • SomeoneElse23
    These giants make their money largely from advertising.

    Block all ads, and they wither away to nothingness.
    Reply
  • gg83
    Non of this ai crap is remotely profitable. They had to pirate the stuff, probably hoping they'd settle for much less than if they did it ethically.
    Reply
  • P.Amini
    I am angry and mad. Even if they pay full prices it is unfair and inappropriate use of what they consume, let alone pirating it. I am really angry and mad about this.
    Reply
  • alrighty_then
    Making a career as an author is looking more dicey each year.
    Reply
  • atomicWAR
    This is just stupid. Meta does this and at best they'll get a small fine that doesn't begin to cover each copyrighted piece of content. If the average Joe got hit with this they would end up with a criminal record and a fine (per copyrighted item) that would likely exceed that of Meta's (assuming they get fined at all).

    I am not a fan of AI and this certainly doesn't help things. Writers/publishers deserve proper payment for their works. Heck I may use ad-block much of the time for security reasons but even then I make a point of supporting the sites or author's I read on the regular by using affiliate links to buy things or purchasing swag directly from said sources to support them. This just disgusts me. I hope the employees involved are punished and Meta ends up with a fine big enough to truly deter them from such behavior in the future.
    Reply
  • ekio
    lol, because you think "Open"AI and Google didn't do the same ??
    Like AI knows what happened in any book just by learning their crappy wikipedia page ?
    Reply