Novelists Sue OpenAI for Copyright Infringement Over Books Used as Training Data

Gavel in front of ChatGPT Screen
(Image credit: Shutterstock (2253205945))

A number of visual artists have filed suit over use of their images as training data for text-to-image generators. Now, two well-known novelists have filed their own class-action suit against OpenAI, accusing the company behind ChatGPT and Bing Chat of copyright infringement because it allegedly used their books as training data. This appears to be the first lawsuit filed over the use of text (as opposed to images or code) being used as training data. 

In the lawsuit filed in the United States District Court of the Northern District of California, plaintiffs Paul Tremblay and Mona Awad allege that OpenAI and its subsidiaries committed copyright infringement, violated the Digital Millennium Copyright Act and also ran afoul of California and common law restrictions on unfair competition. The writers are represented by Joseph Saveri Law Firm and Matthew Butterick, the same team that is behind recent lawsuits filed against Diffusion AI and GitHub (over GitHub copilot).

The complaint alleges that Tremblay's novel The Cabin at the End of the World and two of Awad's novels: 13 Ways of Looking at a Fat Girl and Bunny were used as training data for GPT-3.5 and GPT-4. Though OpenAI has not disclosed that the copyrighted novels are in its training data (which is kept secret), the plaintiffs conclude that they must be because ChatGPT was able to provide detailed plot summaries and answer questions about the books, a feat which would require it to have access to the complete texts.

"Because the OpenAI Language Models cannot function without the expressive information extracted from Plaintiffs’ works (and others) and retained inside them, the OpenAI Language Models are themselves infringing derivative works, made without Plaintiffs’ permission and in violation of their exclusive rights under the Copyright Act," the complaint says.

All three books also carry copyright management information (CMI) such as ISBN and copyright registration numbers. The Digital Millennium Copyright Act (DMCA) states that removing or falsifying CMI is illegal and, since ChatGPT's output does not contain that information, the plaintiffs allege that OpenAI is guilty of violating the DMCA on top of regular copyright infringement. 

Though the lawsuit only has two plaintiffs right now, the lawyers are seeking class action status which would allow other authors who have had copyrighted works used by OpenAI to also collect damages. The lawyers are seeking monetary damages, court costs and an injunction forcing OpenAI to change its software and business practices around copyrighted material.

We reached out to Butterick for comment on the lawsuit and he referred us to his website, LLM Litigation, which has a detailed explanation of the plaintiffs' position and why they are suing.  

"We’ve filed a class-action law­suit against OpenAI chal­leng­ing Chat­GPT and its under­ly­ing large lan­guage mod­els, GPT-3.5 and GPT-4, which remix the copy­righted works of thou­sands of book authors—and many oth­ers—with­out con­sent, com­pen­sa­tion, or credit," the lawyers write.

They also criticize the concept of generative AI, writing that "'Gen­er­a­tive arti­fi­cial intel­li­gence” is just human intel­li­gence, repack­aged and divorced from its cre­ators."

Like Saveri and Butterick's lawsuit against Stability AI for using copyrighted images as training data, this one hinges on the belief that grabbing text from the open Internet to power an LLM is not fair use. That's a question that has not yet been answered in court.

In a 2006 case, Blake vs Google, a writer sued the search engine for caching his work and making the cached versions available via search. However, a U.S. district court dismissed the suit, holding that Google's caching of the data was fair use. Judge Robert C. Jones wrote that holding documents in cache is a transformative use (one of four factors used to determine fair use) and that it doesn't harm the potential market for the work (another factor). So simply storing copyrighted data on its server in the form of a cache did not make Google liable.

However, using a copyrighted creative work as training data is quite a bit different than indexing content for search. One could argue that if the LLM is able to repeat key details from the book, it is harming the market for those works and it is not truly transformative. On the other hand, if a human writes a plot summary of a book, that generally doesn't run afoul of copyright law. Ultimately, these questions are going to be decided because of lawsuits like this one. 

OpenAI isn't the only company that's using copyrighted materials for training or even output. Google SGE, the company's new search experience, often plagiarizes whole sentences and paragraphs word-for-word from copyrighted articles. What happens in this suit could have a much wider impact on the generative AI industry.

TOPICS
Avram Piltch
Avram Piltch is Tom's Hardware's editor-in-chief. When he's not playing with the latest gadgets at work or putting on VR helmets at trade shows, you'll find him rooting his phone, taking apart his PC or coding plugins. With his technical knowledge and passion for testing, Avram developed many real-world benchmarks, including our laptop battery test.
Read more
Mark Zuckerberg on stage
Meta staff torrented nearly 82TB of pirated books for AI training — court records reveal copyright violations
Meta logo
Meta defends using pirated material, claims it's legal if you don't seed content
The OpenAI logo is being displayed on a smartphone, with the Microsoft logo visible on the screen in the background, in this photo illustration taken in Brussels, Belgium
Microsoft and OpenAI investigate whether DeepSeek illicitly obtained data from ChatGPT
Google Gemini Advanced
Google's Gemini 2.0 Flash is reportedly capable of removing watermarks — Also found to generate AI celebrity photos
Elon Musk
Elon Musk and OpenAI to fast-track trial to December — suit aims to stop OpenAI's transition to a for-profit company
Las Vegas Metropolitan Police Department
ChatGPT was used to plan Cybertruck explosion outside Trump hotel in Las Vegas — police release details on prompts used to decide crucial details
Latest in Artificial Intelligence
ChatGPT Security
Some ChatGPT users are addicted and will suffer withdrawal symptoms if cut off, say researchers
Ant Group headquarters
Ant Group reportedly reduces AI costs 20% with Chinese chips
Nvidia
U.S. asks Malaysia to 'monitor every shipment' to close the flow of restricted GPUs to China
Ryzen AI
AMD launches Gaia open source project for running LLMs locally on any PC
Intel CEO at Davos
At Nvidia's GTC event, Pat Gelsinger reiterated that Jensen 'got lucky with AI,' Intel missed the boat with Larrabee
Nvidia
Nvidia unveils DGX Station workstation PCs with GB300 Blackwell Ultra inside
Latest in News
TSMC
Nvidia's Jesnen Huang expects GAA-based technologies to bring a 20% performance uplift
Despite external similarities, the RTX 3090 is not at all the same hardware as the RTX 4090 — even if you lap the GPU and apply AD102 branding.
GPU scam resells RTX 3090 as a 4090 — complete with a fake 'AD102' label on a lapped GPU
Inspur
US expands China trade blacklist, closes susidiary loopholes
WireView Pro 90 degrees
Thermal Grizzly's WireView Pro GPU power measuring utility gets a 90-degree adapter revision
Qualcomm
Qualcomm launches global antitrust campaign against Arm — accuses Arm of restricting access to technology
Nvidia Ada Lovelace and GeForce RTX 40-Series
Analyst claims Nvidia's gaming GPUs could use Intel Foundry's 18A node in the future
  • umeng2002_2
    Why is it ok for a person to read something and learn, and then make money off what he learned; but it's not ok when an AI does it?
    Reply
  • InvalidError
    umeng2002_2 said:
    Why is it ok for a person to read something and learn, and then make money off what he learned; but it's not ok when an AI does it?
    I'm guessing the distinction being that you learn stuff for yourself and subsequent application of that knowledge is limited by you being human while an AI can learn stuff once and apply the knowledge worldwide almost instantaneously.
    Reply
  • BX4096
    InvalidError said:
    I'm guessing the distinction being that you learn stuff for yourself and subsequent application of that knowledge is limited by you being human while an AI can learn stuff once and apply the knowledge worldwide almost instantaneously.
    Only to "provide plot summaries and answer questions about the books", as the plaintiffs themselves state, not to share the actual texts in violation of copyright law.

    To me, this looks more like a cheap PR move to promote their crappy books (the first I checked had 3.05 rating on GoodReads) than an serious argument for copyright infringement.
    Reply
  • bit_user
    InvalidError said:
    I'm guessing the distinction being that you learn stuff for yourself and subsequent application of that knowledge is limited by you being human while an AI can learn stuff once and apply the knowledge worldwide almost instantaneously.
    That's not a violation of the copyright law they're citing, though. They're claiming ChatGPT is a derivative work, but there's a legal standard for that and I think it's one that ChatGPT won't meet.

    Now, if they could trick ChatGPT into regurgitating passages of the books, verbatim, they might have a real copyright case. However, I'm guessing they tried and it didn't or wouldn't.

    I think the plaintiffs are essentially playing the lottery. Even though they probably know their chance of winning is small, they figure the payout could be big enough to make it worth a shot.
    Reply