Court filing in OpenAI case reflects NYT’s fundamental problem with copyright law: facts are free for the taking

Context: Well over a dozen different copyright infringement actions are pending in U.S. courts (March 12, 2024 ai fray article with three-page battlemap). The highest-profile one of them is New York Times v. Microsoft & OpenAI, and motions to dismiss or at least narrow parts of that complaint were brought by OpenAI (which is actually the primary defendant) (February 29, 2024 ai fray article) as well as Microsoft (courtlistener document (PDF)).

What’s new: On Monday (March 11, 2024) the NYT filed its opposition to OpenAI’s motion to dismiss, defending all of its claims as originally and requesting the opportunity to amend the complaint if necessary.

Direct impact: The case may be narrowed to some extent, but unlike some antitrust cases that are primarily resolved at the motion-to-dismiss stage, this present case is a typical summary judgment case, even more so in light of what the Second Circuit (which will also be the appeals court in this dispute) decided in a Google Books case. Based on the parties’ proposed timelines, a summary judgment decision could be made in late 2025.

Wider ramifications: In connection with AI and copyright, the first thought that comes to mind is fair use, an exception to copyright law. But the NYT’s latest filing shows that the NYT’s copyright claims have a problem with an even more fundamental principle of copyright law: the fact/expression (or idea/expression) dichotomy, according to which only the expressive parts and not the underlying facts are protected.

First, here’s the NYT’s opposition brief to OpenAI’s motion to dismiss:

The following passage from the table of contents is striking:

It’s common that litigants will describe themselves as the good guys and the other side as the bad guys. Up to a certain degree, that will always involve oversimplifications and hyperbole. But the above is absurd.

There is undoubtedly a lot of “world-class journalism” to be found in and at the NYT, and it’s why the NYT has this reputation it enjoys, but some of its content comes from the people they quote, some of it is provided by news agencies, and its business model has various pillars and some of them are simply about monetizing reach. That is not the key issue, though. It’s the way they mischaracterize OpenAI as essentially a criminal operation.

If copyright owners sue pirates (software pirates) or those enabling piracy, such as platforms where the vast majority of visits and downloads is driven by piracy (such as illegal copies of music files or illegal rebroadcasts of sports events), that kind of claim is correct. There are businesses that either serve no legitimate purpose or the percentage of the overall usage volume that is lawful is just tiny. Not so with OpenAI, which started with a technological vision and wanted to achieve something positive for humanity.

The following paragraph from the NYT’s brief is telling because it states the nature of the business model concern, but incorrectly attributes to copyright infringement something that falls squarely outside the purview of copyright law:

“The Times funds its journalism through revenues derived from subscriptions, advertising, licensing, and affiliate referrals. […] Generating and maintaining traffic to The Times’s content is critical to its revenue streams. […] To facilitate that traffic, The Times permits search engines like Microsoft Bing to access and index its content, but inherent in this value exchange is the idea that search engines will direct users to The Times’s own websites and mobile applications, rather than exploit The Times’s content to keep users within their own search ecosystem. […] Defendants’ generative AI products threaten to divert that traffic.”

What the NYT correctly explains can be described as follows: Generative AI enables search engines to satisfy the information needs of a higher percentage of users than traditional search engines (though Google over time added more and more features that had the same effect without being categorized as GenAI). As a result, there may very well be fewer click-throughs on balance: GenAI results in some additional click-throughs that wouldn’t happen without it (if people elect to take a look at specific sources), but if many readers are just happy with the information the GenAI chatbot provides to them, they’re done. That does mean more revenue for search engines, less for publishers, and ai fray does not deny that there are strong reasons to assume that the number of click-throughs (including to this very website) lost will exceed that of the ones that would not happen in the absence of GenAI.

Take the 1969 moon landing (Apollo 11 mission), for example. The average person trying to get the key facts will use a search engine, from where the next step, with a high likelihood, is Wikipedia. Or they’ll go directly to Wikipedia. The search engine will list various web pages deemed relevant, and Wikipedia has some links to sources in fine print at the bottom, with presumably a far lower click-through rate than the sources provided by ChatGPT.

If those users now ask ChatGPT, it will produce a text based on what it has inferred from the analysis of huge amounts of articles, one of which is presumably the July 21, 1969 New York Times article Men Walk On Moon. That article can be found in an online archive. Some of the most important passages are actually quotes like “Eagle has landed” and “one small step for man.” But there is no question that the NYT produced a well-written article that states the facts, explains their significance and illustrates everything with plenty of quotes.

The likelihood of a Google user finding that article is very low: there’s plenty of more recent articles on that historic event. The archived NYT article is not among the first few dozen search results beyond which there are practically no click-throughs.

The NYT’s filing claims “Defendants have misappropriated almost a century’s worth of copyrighted content, without paying fair compensation.” So that 1969 article would just be in the middle of the period with respect to which the NYT now claims it deserves a lot more compensation than OpenAI offered.

The fundamental issue has to do with the fact/expression dichotomy, often also called idea/expression dichotomy.

Simply put, there’s a part that adversely affects the NYT’s business to the extent it depends on or at least benefits from traffic from search engines, and there’s a part that makes no impact, or just a negligible one. What hurts the NYT and other publishers is that GenAI chatbots provide another means of obtaining facts (as does Wikipedia, without which there’d also be more traffic going to websites like the NYT’s). Those facts are not protected by copyright. What copyright law protects is creative expression, and the NYT apparently can’t show examples of how its expressive material is regurgitated unless one provokes it by typing a sequence of words from a particular NYT article (which as a matter of statistics is highly unlikely to happen unless one already knows the article and has access to it while entering the prompt).

Quoting a legal scholar, the Supreme Court in a 1991 decision (Feist v. Rural) effectively endorsed the blunt way of putting it: facts are free for the taking.

“[N]o matter how much original authorship the work displays, the facts and ideas it exposes are free for the taking. . . . [T]he very same facts and ideas may be divorced from the context imposed by the author, and restated or reshuffled by second comers, even if the author was the first to discover the facts or to propose the ideas.”
quoting Ginsburg, Creation and Commercial Value: Copyright Protection of Works of Information, 90 Colum.L.Rev.

The Supreme Court then continued that “[t]his may seem unfair,” but (quoting a dissent from an earlier case) “this is not ‘some unforeseen byproduct of a statutory scheme’”:

It is, rather, “the essence of copyright,” […] and a constitutional requirement. The primary objective of copyright is not to reward the labor of authors, but “[t]o promote the Progress of Science and useful Arts.” […] To this end, copyright assures authors the right to their original expression, but encourages others to build freely upon the ideas and information conveyed by a work.

This principle, known as the idea/expression or fact/expression dichotomy, applies to all works of authorship. As applied to a factual compilation, assuming the absence of original written expression, only the compiler’s selection and arrangement may be protected; the raw facts may be copied at will. This result is neither unfair nor unfortunate. It is the means by which copyright advances the progress of science and art.

[The Supreme] Court has long recognized that the fact/expression dichotomy limits severely the scope of protection in fact-based works.

In past cases requiring a distinction between facts and expression, the facts were typically taken from one source. OpenAI, however, processes all the material it can get to draw inferences. If you ask OpenAI about the 1969 moon landing, it has likely analyzed not only dozens or hundreds, but even thousands of texts that talk about it.

The capability of ChatGPT that affects the NYT’s business in the form of reduced organic traffic from search engines is something lawful: stating facts. ChatGPT would be allowed to do that if, as mentioned, the NYT were (which it is far from) its exclusive source.

The NYT tries to avoid the fact/expression dichotomy by focusing on the fact that articles are copied and stores somewhere, and regurgitation is then used as evidence that somewhere on OpenAI’s servers there still is the original material, even if maybe stored in a form that is not human-readable unless and until ones makes a query based on a passage from a particular article.

If OpenAI accessed a given article just once to glean some facts from it, and then stored only the facts, the facts would be separated more clearly from the expression than when you train a language model, for which you use the full text, which is expressive. But all that is done (apart from the rare phenomenon of regurgitation) with those expressive material is to evaluate it statistically.

A simplified example is that a computer program could go over a book and count the occurrences of letters, words and punctuation symbols. It would then tell you that the word “and” occurs X thousand times throughout the original Harry Potter book. Large Language Models are far more sophisticated as they identify correlations and can put out complete texts as opposed to mere counts. But what the simplified example and AI training have in common is that even if the AI software went over the same document a million times, that would not be the equivalent of a single human reader getting access to the book for free. It’s just a technical process, not media consumption.

Fair use involves the question of how the derivative (and in this case undoubtedly transformative) work impacts the market for the original one. The passage from the NYT’s opposition brief (quoted further above) about reduced organic traffic from search engines is true with respect to facts, which are for the taking, and not true with respect to expressive material, which isn’t elicited from ChatGPT unless one already has access to the article at any rate and quotes so much from it that ChatGPT will then try to find the continuation of that text (and even that is something they’re apparently working hard to avoid going forward).

That’s one of the reasons for which it’s important that the question of fair use be resolved on summary judgment and not by a jury. With a jury, the NYT’s lawyers can try to create confusion. They can talk about the negative impact of facts being available from ChatGPT, which is normally not an act of infringement, and make it all a story of OpenAI’s $90B valuation, Microsoft’s $3T valuation, the entitlement of world-class journalists to fair compensation and other considerations that make something appear unfair instead of properly distinguishing between harm from lawful acts (again, facts are for the taking) versus an alleged violation of copyright in an author’s expression.