Context: Books are among the primary sources of training material for LLMs, and arguably even the most useful media category for that purpose. Class-action lawsuits brought by fiction and non-fiction book authors against OpenAI and Microsoft were consolidated in the Southern District of New York (June 22, 2024 ai fray article).
What’s new: A new copyright infringement lawsuit was filed yesterday in the Northern District of California, with three book authors (of both fiction and non-fiction works) forming the nucleus of a class action against Claude LLM maker Anthropic, a startup valued in the tens of billions of dollars (and revenues approaching a billion dollars a year) that has received backing from Amazon and Google.
Direct impact: For various reasons, among them the fact that Anthropic is proud of its LLM’s ability to put out entire novels and also lead counsel’s profile, this case promises to be one of the four or five most interesting AI copyright cases (among a few dozen already) in the United States.
Wider ramifications: Copyright enforcement efforts against AI systems draw a lot of attention when OpenAI is the target, but there are already a number of interesting cases targeting other AI providers.
First, here’s the complaint, which is as focused as it is forcefully written:
A few initial observations:
Timing: The consolidated book authors’ case against OpenAI and Microsoft in the Southern District of New York is further along, but no major decision has been made there yet. The California case against Anthropic could still become the first book authors’ case in which certain AI copyright questions are decided. It was a logical choice for book authors (meaning the ones who don’t want their works to be used to train AI LLMs) to bring claims against Anthropic. A recent public admission by Anthropic that a repository of pirated books was used in the training of Anthropic’s Claude AI system probably didn’t trigger this complaint, but may have been a reason not to wait much longer as it was nothing short of an invitation to sue.
Venue: Anthropic is headquartered in San Francisco, making the Northern District of California a venue from which the case won’t be transferred elsewhere. While patent holders generally avoid that forum as best as they can, it has so far not appeared to be hostile to copyright holder in an AI context. In fact, just about a week ago, a judge in that same district allowed copyright infringement claims by visual artists go forward against Stability AI and other defendants (August 16, 2024 article by legal.io). Even though the Digital Millennium Copyright Act (DMCA) claims were thrown out, Judge William H. Orrick acknowledged that the question of whether there is a strict identicality requirement for the removal of copyright management information was “unsettled,” which supports some class-action lawyers’ argument that this particular legal question should be appealed prior to a final judgment on other claims (July 27, 2024 ai fray article).
Book authorship: While ChatGPT is largely used as a knowledge base (a Q&A machine) and for the generation of shorter documents, Anthropic prides itself on the production of entire novels. That makes it a particularly logical target for the book authors’ copyright infringement lawsuit. The complaint mentions an example of a book (tech journalist Kara Swisher’s memoir Burn Book) that ended up having to compete on Amazon with “AI generated copycats.” The following paragraph also highlights the substitutive impact of Anthropic’s Claude AI:
“Claude in particular has been used to generate cheap book content. For example, in May 2023, it was reported that a man named Tim Boucher had ‘written’ 97 books using Anthropic’s Claude (as well as OpenAI’s ChatGPT) in less than year, and sold them at prices from $1.99 to $5.99. Each book took a mere “six to eight hours” to “write” from beginning to end. Claude could not generate this kind of long-form content if it were not trained on a large quantity of books, books for which Anthropic paid authors nothing”
Counsel: This case was brought by the three firms behind a book authors’ case in the Southern District of New York against OpenAI, Susman Godfrey, Lieff Cabraser Heimann & Bernstein, and Cowen Debaets Abrahams & Sheppard. First-named as lead counsel is Susman’s Justin Nelson, who achieved a $800M settlement for his client Dominion Voting Systems against Fox News last year (April, 18, 2023 NPR report). Mr. Nelson has also had various successes enforcing intellectual property rights.
Focus on reproduction of material for training purposes (input): Like the book authors’ case in New York, this one focuses on input rather than output. The complaint alleges copyright infringement by reproduction of the relevant material, and the text makes it clear that this is about how LLMs are trained.
Simple message: The complaint tries to focus its readers on the simple fact that while humans who learn from books will first (have to) legally buy them, Anthropic “cut corners”:
“Book readers typically purchase books. Anthropic did not even take that basic and insufficient step. Anthropic never sought—let alone paid for—a license to copy and exploit the protected expression contained in the copyrighted works fed into its models. Instead, Anthropic did what any teenager could tell you is illegal.” (emphasis added; it’s highly unusual for legal documents to use the “you” form, but here it was clearly intended to connect with readers, which one day may be jury members)
Anthropomorphisms: Several passages allude to Anthropic’s claims that Claude AI is particularly similar to human beings in how it operates. The difference between humans learning from books they buy or legally borrow and the use of a repository of pirated books (“The Pile”) being used to train Anthropic is just one such example.
Licensing: The complaint notes that OpenAI has already entered into various content license agreements, unlike Anthropic.
Amazon and Google: Both those Big Tech companies value their strategic alliance with Anthropic (despite developing AI systems internally as well). An Amazon company being sued by book authors is ironic if one remembers that Amazon started as an online bookseller (though the vision had presumably always been to build an online store carrying everything). With Google the situation is somewhat nuanced. On the one hand, no Big Tech company has advocated the weakening of U.S. copyright law, through its defenses to copyright infringement lawsuits, more than Google. On the other hand, Google itself is none too pleased to see other AI providers scraping YouTube content (which is almost entirely created by others than Google, but Google has gatekeeper power over those creators and would like to control the way YouTube content is used to train AI systems).