Novelists Sue OpenAI for Copyright Infringement Over Books Used as Training Data

Gavel in front of ChatGPT Screen
(Image credit: Shutterstock (2253205945))

A number of visual artists have filed suit over use of their images as training data for text-to-image generators. Now, two well-known novelists have filed their own class-action suit against OpenAI, accusing the company behind ChatGPT and Bing Chat of copyright infringement because it allegedly used their books as training data. This appears to be the first lawsuit filed over the use of text (as opposed to images or code) being used as training data. 

In the lawsuit filed in the United States District Court of the Northern District of California, plaintiffs Paul Tremblay and Mona Awad allege that OpenAI and its subsidiaries committed copyright infringement, violated the Digital Millennium Copyright Act and also ran afoul of California and common law restrictions on unfair competition. The writers are represented by Joseph Saveri Law Firm and Matthew Butterick, the same team that is behind recent lawsuits filed against Diffusion AI and GitHub (over GitHub copilot).

The complaint alleges that Tremblay's novel The Cabin at the End of the World and two of Awad's novels: 13 Ways of Looking at a Fat Girl and Bunny were used as training data for GPT-3.5 and GPT-4. Though OpenAI has not disclosed that the copyrighted novels are in its training data (which is kept secret), the plaintiffs conclude that they must be because ChatGPT was able to provide detailed plot summaries and answer questions about the books, a feat which would require it to have access to the complete texts.

"Because the OpenAI Language Models cannot function without the expressive information extracted from Plaintiffs’ works (and others) and retained inside them, the OpenAI Language Models are themselves infringing derivative works, made without Plaintiffs’ permission and in violation of their exclusive rights under the Copyright Act," the complaint says.

All three books also carry copyright management information (CMI) such as ISBN and copyright registration numbers. The Digital Millennium Copyright Act (DMCA) states that removing or falsifying CMI is illegal and, since ChatGPT's output does not contain that information, the plaintiffs allege that OpenAI is guilty of violating the DMCA on top of regular copyright infringement. 

Though the lawsuit only has two plaintiffs right now, the lawyers are seeking class action status which would allow other authors who have had copyrighted works used by OpenAI to also collect damages. The lawyers are seeking monetary damages, court costs and an injunction forcing OpenAI to change its software and business practices around copyrighted material.

We reached out to Butterick for comment on the lawsuit and he referred us to his website, LLM Litigation, which has a detailed explanation of the plaintiffs' position and why they are suing.  

"We’ve filed a class-action law­suit against OpenAI chal­leng­ing Chat­GPT and its under­ly­ing large lan­guage mod­els, GPT-3.5 and GPT-4, which remix the copy­righted works of thou­sands of book authors—and many oth­ers—with­out con­sent, com­pen­sa­tion, or credit," the lawyers write.

They also criticize the concept of generative AI, writing that "'Gen­er­a­tive arti­fi­cial intel­li­gence” is just human intel­li­gence, repack­aged and divorced from its cre­ators."

Like Saveri and Butterick's lawsuit against Stability AI for using copyrighted images as training data, this one hinges on the belief that grabbing text from the open Internet to power an LLM is not fair use. That's a question that has not yet been answered in court.

In a 2006 case, Blake vs Google, a writer sued the search engine for caching his work and making the cached versions available via search. However, a U.S. district court dismissed the suit, holding that Google's caching of the data was fair use. Judge Robert C. Jones wrote that holding documents in cache is a transformative use (one of four factors used to determine fair use) and that it doesn't harm the potential market for the work (another factor). So simply storing copyrighted data on its server in the form of a cache did not make Google liable.

However, using a copyrighted creative work as training data is quite a bit different than indexing content for search. One could argue that if the LLM is able to repeat key details from the book, it is harming the market for those works and it is not truly transformative. On the other hand, if a human writes a plot summary of a book, that generally doesn't run afoul of copyright law. Ultimately, these questions are going to be decided because of lawsuits like this one. 

OpenAI isn't the only company that's using copyrighted materials for training or even output. Google SGE, the company's new search experience, often plagiarizes whole sentences and paragraphs word-for-word from copyrighted articles. What happens in this suit could have a much wider impact on the generative AI industry.

Avram Piltch
Avram Piltch is Tom's Hardware's editor-in-chief. When he's not playing with the latest gadgets at work or putting on VR helmets at trade shows, you'll find him rooting his phone, taking apart his PC or coding plugins. With his technical knowledge and passion for testing, Avram developed many real-world benchmarks, including our laptop battery test.
  • umeng2002_2
    Why is it ok for a person to read something and learn, and then make money off what he learned; but it's not ok when an AI does it?
    Reply
  • InvalidError
    umeng2002_2 said:
    Why is it ok for a person to read something and learn, and then make money off what he learned; but it's not ok when an AI does it?
    I'm guessing the distinction being that you learn stuff for yourself and subsequent application of that knowledge is limited by you being human while an AI can learn stuff once and apply the knowledge worldwide almost instantaneously.
    Reply
  • BX4096
    InvalidError said:
    I'm guessing the distinction being that you learn stuff for yourself and subsequent application of that knowledge is limited by you being human while an AI can learn stuff once and apply the knowledge worldwide almost instantaneously.
    Only to "provide plot summaries and answer questions about the books", as the plaintiffs themselves state, not to share the actual texts in violation of copyright law.

    To me, this looks more like a cheap PR move to promote their crappy books (the first I checked had 3.05 rating on GoodReads) than an serious argument for copyright infringement.
    Reply
  • bit_user
    InvalidError said:
    I'm guessing the distinction being that you learn stuff for yourself and subsequent application of that knowledge is limited by you being human while an AI can learn stuff once and apply the knowledge worldwide almost instantaneously.
    That's not a violation of the copyright law they're citing, though. They're claiming ChatGPT is a derivative work, but there's a legal standard for that and I think it's one that ChatGPT won't meet.

    Now, if they could trick ChatGPT into regurgitating passages of the books, verbatim, they might have a real copyright case. However, I'm guessing they tried and it didn't or wouldn't.

    I think the plaintiffs are essentially playing the lottery. Even though they probably know their chance of winning is small, they figure the payout could be big enough to make it worth a shot.
    Reply