OpenAI deleted ‘books1’ and ‘books2’ training datasets: water under the copyright bridge, sign of guilt, or spoliation of evidence?

Context: While the New York Times’s copyright lawsuit against OpenAI and Microsoft is the highest-profile one at the moment (June 12, 2024 ai fray article), the Authors Guild and various book authors previously brought copyright claims against the same defendants. The two most important ones of the book authors’ cases (Southern District of New York, case nos. 1:23-cv-08292-SHS and 1:23-cv-10211-SHS) have since been consolidated into a single consolidated class-action complaint (February 5, 2024 complaint (PDF)).

What’s new: Compared to the NYT case (on which the NYT itself reports rather frequently), the book authors’ consolidated case gets much less attention. As ai fray has found out now as part of its research of related cases, a noteworthy admission was made by OpenAI in a March 22, 2024 letter that became publicly discoverable when it was attached to a May 6, 2024 court filing. OpenAI does not deny that it created and used two training datasets named books1 and books2, and admits to having deleted them, but disputes their relevance.

Direct impact: It is too early to predict what this will mean, but there is a wide range of consequences it could have. The purpose of this article is not to predict, not even between the lines, a likely outcome, but to analyze and explain the hypothetical fallout, including spoliation-of-evidence sanctions (such as a mandatory or permissive adverse inference) and the psychological effect of what the book authors’ lawyers could try to portray as a sign of guilt in the further proceedings.

Wider ramifications: As Google has experienced primarily in its antitrust dispute with Epic Games, the deletion of potentially relevant evidence can draw the ire of U.S. judges. There are, however, fundamental differences. What made Google’s deletion of sensitive chats worse was that the company’s leadership told everyone to discuss certain matters only in chatrooms that would not keep logs for more than a day. One key lesson from the Google Chat issue (over which a federal judge in San Francisco even wants to conduct a separate, further investigation) is that judges decline to believe that the deletion of quantities of rather limited amounts of data by companies with vast IT resources is likely to amount to routine housekeeping.

Here’s the May 6, 2004 letter motion by Authors Guild and its co-plaintiffs (book authors) that first described the deletion of the books1 and books2 datasets as a very serious issue:

The relevant passage begins on page 3 of that letter with the section heading “IDENTITIES OF CRITICAL FORMER EMPLOYEES.”

The fact that those books1 and books2 datasets existed and were used to train ChatGPT was not a new relevation, though. The book authors’ complaint already mentioned them and said (among other things):

“Some independent AI researchers suspect that Books2 contains or consists of ebook files downloaded from large pirate book repositories.”

The evidentiary dispute here is now the following: OpenAI will deny whatever it can deny, and in this case it doesn’t even want those alleged facts to be admitted.

The easiest way to find out whether pirated books were included in the training data, and generally to find out what was or was not contained in those two files, would be to examine those files. But according to OpenAI’s lawyers, that is not possible. They wrote a letter to the book authors’ counsel on March 22, 2024 that said the files had been deleted:

Oddly, the final paragraph of that letter says that “all other pre-training data for GPT-3 and GPT-3.5 is available for inspection.” But OpenAI’s lawyers “understand that the use of books1 and books2 for model training was discontinued in late 2021, after the training of GPT-3 and GPT-3.5, and those were then deleted in or around mid-2022 due to their non-use.” They said in that letter they were still trying to find the files or information on their “composition” (as a potential substitute to the actual datasets).

The book authors’ lawyers now want to talk to the two relevant former OpenAI employees about it:

“Given that OpenAI destroyed the direct evidence of the content of books1 and books2, these former employees are critically important to this case.”

That wording suggests, without being an outright accusation, a spoliation of evidence. If litigators tell you that you’ve “destroyed […] evidence,” you know that it takes just one more analytical step before you’ll find yourself on the receiving end of a motion for spoliation sanctions (as Google knows particularly well).

Such letters in the discovery context are frequently signed by someone other than, but (at least if it’s important) at the direction of, lead counsel. The consolidated class-action complaint lists three law firms, and it starts with Susman Godfrey. The first signatory is none other than Susman partner Justin Nelson, whose attorney profile on the firm’s website mentions that he “represented Dominion Voting Systems in its litigation against Fox, culminating in a $787.5 million settlement.” That’s the case over voting machines used in the 2020 election. One specialized publication wrote that he’s always been a “go to” lawyer and is now a “must have” lawyer (April 24, 2023 LawFuel article). CNN reported that this was “the best lawyering [the judge has] had, ever.” And his profile shows he clerked for Supreme Court Justice Sandra Day O’Connor, a clerkship one only gets by being one of the very best law students in the entire United States. ip fray recently reported on his new multi-billion-dollar (and AI-related) patent lawsuit against Microsoft (June 10, 2024 ip fray article).

With that kind of legal firepower, what will OpenAI (whose lawyers are obviously also world-class) have to expect here?

The deletion of those book datasets allegedly occurred in mid-2022. That was before those lawsuits started. If it had happened subsequently to the book authors’ complaint, it would be even worse than the discovery issues Google is dealing with, but according to OpenAI’s letter, it was well before.

Is it credible that they deleted those datasets only to free up space on some storage media, as regular housecleaning? Or is there more to it?

At this stage, ai fray does not want to speculate on what is more likely to be the case. What can be said is that if companies with vast IT resources delete data that is relevant to the issues in a given litigation, judges are skeptical if the explanation is that it had nothing to do with actual or potential legal problems.

Should the court conclude that OpenAI destroyed evidence with a view to potential (and by now, actual) copyright enforcement actions, there would be a wide range of sanctions. The one that would up the ante for OpenAI most is if the court entered an adverse-inference instruction. There are two kinds of such instructions: the judge can tell the jury that it must assume something, or merely that it may draw certain conclusions from the destruction of evidence. The latter does not hurt as much as the former, but if the jury then loses trust in a party, the outcome can be just as bad.

For now, the docket does not contain a motion for sanctions, making all of this even more hypothetical. But there is an evidentiary dispute, and all sorts of things could happen after the book authors’ lawyers have had the chance to quiz those two former OpenAI employees. A can of worms, if not Pandora’s Box, has been opened here.

Even in the absence of a motion for spoliation sanctions, the book authors’ lawyers led by the above-mentioned Justin Nelson can try to gain mileage out of this in different ways. There is a risk here for OpenAI that the deletion of those files will be perceived by jurors as a sign of guilt. It looks like some people at OpenAI weren’t comfortable with keeping those files. Now, if there is a fair use defense, then they were within their rights to use the material the way they did it, and then it technically doesn’t matter if they deleted those datasets. But if they strongly felt they had a fair use defense, why did they delete those files?

The further process will probably provide an indication as to in which way(s) the deletion of the books1 and books2 datasets will influence a further trial of the book authors’ case against OpenAI. Microsoft is also a defendant, but obviously has nothing to do with the deletion of those files. Microsoft has probably never had to explain anything like that, not even remotely.