OpenAI’s lawyers to court: New York Times doesn’t want to come clean on efforts to get ChatGPT to output passages from articles

Context: The (hypothetical) stakes are high in the copyright dispute between the New York Times and defendants OpenAI and Microsoft (May 21, 2024 ai fray article). But the case will be much smaller if the defendants can show that the “regurgitation” of entire paragraphs from NYT articles (or even entire articles as the result of a sequence of prompts) is an extremely rare bug and that the NYT and its lawyers couldn’t easily produce the outputs on which they base their main infringement allegation.

What’s new: A letter by OpenAI’s lawyers to Judge Sidney H. Stein of the United States District Court for the Southern District of New York raises two discovery issues, among them the fact that the NYT’s lawyers so far decline to provide “{a]ll Documents and Communications relating to the creation of Exhibit J of the Complaint” (where the output of long passages from articles, or entire articles, is shown).

Direct impact: The question for Judge Stein to resolve is whether the attorney-client privilege and the protection of attorney work product that normally apply has been effectively waived “because that data and information was voluntarily disclosed to OpenAI in the course of interacting with ChatGPT LLMs.” In other words, the prompts that the NYT’s lawyers used were entered into a ChatGPT chat, so they were already disclosed, just that OpenAI doesn’t have a record of that. A second argument for requiring disclosure and piercing through the usual protection of privileged material is that the NYT put the related material at issue in this case (by using it in the complaint, at least if Exhibit J is included). As OpenAI’s lawyers argue, it is a very relevant question “how difficult it was for Plaintiff to generate those outputs and whether its methodology accurately approximates realistic use of ChatGPT LLMs.”

Wider ramifications: This is a discovery issue that would also be relevant to some other AI copyright disputes, at least where defendants dispute that certain sample outputs were generated by methods that “accurately approximate[] realistic use” of the accused AI systems.

First, here’s the letter that OpenAI’s lawyers filed with the court yesterday:

It’s the second part that starts in the lower half of page 1 with the subhead “Plaintiff’s Regurgitation Efforts” that raises the question of whether the NYT’s lawyers can withhold documents related to how they tried to get ChatGPT to output paragraphs from, or entire, NYT articles in a near-verbatim form.

Different requests for production (RFPs) have given rise to disagreement between the parties. The one that explains in the simplest and broadest terms the request for material relating to how the NYT elicited from ChatGPT a “regurgitation” is RFP 2:

The part between the commas relates to “failed attempts” to get ChatGPT to regurgitate NYT articles. That is very important because information on huge numbers of failed attempts would greatly diminish the relevance of the evidence presented.

RFP no. 22 is more detailed and relates to all the prompts that were entered:

In that context, too, the prompts that failed to result in regurgitation could make a major difference in this litigation.

It will be interesting to see how the court resolves this discovery dispute, and whether the information that OpenAI is seeking will make a huge impact, to the effect of potentially proving that what the NYT is litigating over with respect to regurgitation is just an outlier scenario that even the NYT’s lawyers and a consultant they hired for the purpose of producing those outputs couldn’t easily elicit from ChatGPT.