Novelists Sue OpenAI for Copyright Infringement Over Books Used as Training Data

Gavel in front of ChatGPT Screen
(Image credit: Shutterstock (2253205945))

A number of visual artists have filed suit over use of their images as training data for text-to-image generators. Now, two well-known novelists have filed their own class-action suit against OpenAI, accusing the company behind ChatGPT and Bing Chat of copyright infringement because it allegedly used their books as training data. This appears to be the first lawsuit filed over the use of text (as opposed to images or code) being used as training data. 

In the lawsuit filed in the United States District Court of the Northern District of California, plaintiffs Paul Tremblay and Mona Awad allege that OpenAI and its subsidiaries committed copyright infringement, violated the Digital Millennium Copyright Act and also ran afoul of California and common law restrictions on unfair competition. The writers are represented by Joseph Saveri Law Firm and Matthew Butterick, the same team that is behind recent lawsuits filed against Diffusion AI and GitHub (over GitHub copilot).

"Because the OpenAI Language Models cannot function without the expressive information extracted from Plaintiffs’ works (and others) and retained inside them, the OpenAI Language Models are themselves infringing derivative works, made without Plaintiffs’ permission and in violation of their exclusive rights under the Copyright Act," the complaint says.

All three books also carry copyright management information (CMI) such as ISBN and copyright registration numbers. The Digital Millennium Copyright Act (DMCA) states that removing or falsifying CMI is illegal and, since ChatGPT's output does not contain that information, the plaintiffs allege that OpenAI is guilty of violating the DMCA on top of regular copyright infringement. 

We reached out to Butterick for comment on the lawsuit and he referred us to his website, LLM Litigation, which has a detailed explanation of the plaintiffs' position and why they are suing.  

"We’ve filed a class-action law­suit against OpenAI chal­leng­ing Chat­GPT and its under­ly­ing large lan­guage mod­els, GPT-3.5 and GPT-4, which remix the copy­righted works of thou­sands of book authors—and many oth­ers—with­out con­sent, com­pen­sa­tion, or credit," the lawyers write.

They also criticize the concept of generative AI, writing that "'Gen­er­a­tive arti­fi­cial intel­li­gence” is just human intel­li­gence, repack­aged and divorced from its cre­ators."

Like Saveri and Butterick's lawsuit against Stability AI for using copyrighted images as training data, this one hinges on the belief that grabbing text from the open Internet to power an LLM is not fair use. That's a question that has not yet been answered in court.

However, using a copyrighted creative work as training data is quite a bit different than indexing content for search. One could argue that if the LLM is able to repeat key details from the book, it is harming the market for those works and it is not truly transformative. On the other hand, if a human writes a plot summary of a book, that generally doesn't run afoul of copyright law. Ultimately, these questions are going to be decided because of lawsuits like this one. 

TOPICS

Avram Piltch is Managing Editor: Special Projects. When he's not playing with the latest gadgets at work or putting on VR helmets at trade shows, you'll find him rooting his phone, taking apart his PC, or coding plugins. With his technical knowledge and passion for testing, Avram developed many real-world benchmarks, including our laptop battery test.

  • umeng2002_2
    Why is it ok for a person to read something and learn, and then make money off what he learned; but it's not ok when an AI does it?
    Reply
  • InvalidError
    umeng2002_2 said:
    Why is it ok for a person to read something and learn, and then make money off what he learned; but it's not ok when an AI does it?
    I'm guessing the distinction being that you learn stuff for yourself and subsequent application of that knowledge is limited by you being human while an AI can learn stuff once and apply the knowledge worldwide almost instantaneously.
    Reply
  • BX4096
    InvalidError said:
    I'm guessing the distinction being that you learn stuff for yourself and subsequent application of that knowledge is limited by you being human while an AI can learn stuff once and apply the knowledge worldwide almost instantaneously.
    Only to "provide plot summaries and answer questions about the books", as the plaintiffs themselves state, not to share the actual texts in violation of copyright law.

    To me, this looks more like a cheap PR move to promote their crappy books (the first I checked had 3.05 rating on GoodReads) than an serious argument for copyright infringement.
    Reply
  • bit_user
    InvalidError said:
    I'm guessing the distinction being that you learn stuff for yourself and subsequent application of that knowledge is limited by you being human while an AI can learn stuff once and apply the knowledge worldwide almost instantaneously.
    That's not a violation of the copyright law they're citing, though. They're claiming ChatGPT is a derivative work, but there's a legal standard for that and I think it's one that ChatGPT won't meet.

    Now, if they could trick ChatGPT into regurgitating passages of the books, verbatim, they might have a real copyright case. However, I'm guessing they tried and it didn't or wouldn't.

    I think the plaintiffs are essentially playing the lottery. Even though they probably know their chance of winning is small, they figure the payout could be big enough to make it worth a shot.
    Reply