In-depth reporting and analytical commentary on artificial intelligence regulation. No legal advice.

Perplexity AI ignores New York Times’s scraping ban (robots.txt), summarizes copyrighted article on EURO soccer match

Context: The use of copyrighted material by AI providers has already given rise to several infringement complaints, and the New York Times (NYT) is arguably the highest-profile media outlet to sue (June 12, 2024 ai fray article). One would assume that AI providers would be particularly cautious about potentially infringing NYT material. But no:

What’s new: A leading EU/German competition lawyer and regulatory expert has just demonstrated that Perplexity AI, a GenAI-powered search engine seeking to compete with Google (February 1, 2024 NYT article), regurgitates NYT content in contravention of a scraping prohibition on the NYT’s website that Perplexity’s search engine itself confirms to be aware of. A video (embeded further below) shows that Perplexity reproduces the NYT’s report on last night’s opening game (Germany vs. Scotland) of the EURO 2024 tournament of national soccer teams.

Direct impact: While OpenAI describes regurgitation as a rare bug and is embroiled in a discovery dispute with the NYT over failed attempts to elicit the reproduction of long passages from NYT articles (May 30, 2024 ai fray article), it appears shocking easy to prompt Perplexity to display summaries of paywalled media content. It is unclear to what extent this amounts to regurgitation, but it appears that what Perplexity does is, at best, to paraphrase NYT content.

Wider ramifications: Forbes has alleged the reproduction of its own paywalled content by Perplexity (June 7, 2024 Forbes article). So far, AI copyright lawsuits have focused on other targets, but Perplexity couldn’t do much more to provoke enforcement action.

Professor Dr. Thomas Hoeppner (“Höppner” in German), a partner at the well-known Hausfeld firm who has represented the media industry in high-profile cases such as against Google before the European Court of Justice, ran a simple two-step experiment:

  • He asked Perplexity a question about yesterday’s EURO opening match. Perplexity apparently quoted the NYT article verbatim and pointed to the NYT as its source (in the form of links).
  • He then asked Perplexity whether it is actually barred by the NYT’s robots.txt file from scraping NYT content, and Perplexity confirmed.

Professor Hoeppner recorded a video and thankfully authorized ai fray to share it:

Maybe his interest in yesterday’s Germany vs. Scotland soccer match is attributable to the fact that those are the two countries in which he went to law school.

Perplexity’s valuation has been skyrocketing in recent months from $540 million in January to $1 billion in March, with a subsequent effort to raise funds at a $2.5-3 billion valuation (April 23, 2024 TechCrunch article and May 29, 2024 The Information article).

There is no question that many media companies would like to see Google’s search engine come under competitive pressure. But they obviously don’t want it to happen at their expense.

This is the content of the NYT’s robots.txt file, which contains an explixit prohibition directed at the PerplexityBot:

# New York Times content is made available for your personal, non-commercial
# use subject to our Terms of Service here:
# Use of any device, tool, or process designed to data mine or scrape the content
# using automated means is prohibited without prior written permission from
# The New York Times Company. Prohibited uses include but are not limited to:
# (1) text and data mining activities under Art. 4 of the EU Directive on Copyright in
# the Digital Single Market;
# (2) the development of any software, machine learning, artificial intelligence (AI),
# and/or large language models (LLMs);
# (3) creating or providing archived or cached data sets containing our content to others; and/or
# (4) any commercial purposes.
# Contact https://nytlicensing.com/contact/ for assistance.

# Disallow Rules
User-agent: anthropic-ai
Disallow: /
[..] 
User-agent: GPTBot
Disallow: /
[..] 
User-agent: PerplexityBot
Disallow: /

On LinkedIn, Professor Hoeppner, who has been spending a lot of time lately preparing for speeches on this very topic, notes that “[i]f even the NYT can’t protect itself, SME [small and medium-sized enterprise] local publishers may fare worse.”

Ignoring a robots.txt file is not necessarily copyright infringement. There could still be a defense, particularly fair use. According to Professor Hoeppner, those robots.txt files are not really useful. At least with respect to this example featuring news content that is less than 24 hours old, he has definitely provided an example.