'Would-Be Bank Robbers': Reddit Sues Perplexity, Data Firms Over AI Scraping

Reddit is suing the AI search developer Perplexity and the companies from which it buys AI training data, alleging the data firms are illegally scraping its content, violating its copyright protections.

The lawsuit was filed on Wednesday in the US District Court for the Southern District of New York. In addition to Perplexity, three data firms are named as defendants: Oxylabs UAB, AWMProxy and SerpApi.

In the filing, Reddit said the data firms circumvented Reddit and Google's technological barriers by accessing nearly 3 billion search engine result pages, or SERPs, in a two-week period in July, using techniques to mask their identities and locations. Reddit called them "would-be bank robbers, who, knowing they cannot get into the bank vault, break into the armored truck carrying the cash instead."

Don't miss any of our unbiased tech content and lab-based reviews. Add CNET as a preferred Google source.

Reddit said it traced that illegally scraped data back to Perplexity, which is why it previously issued a cease-and-desist letter. Perplexity is still listed as a customer of one of the data firms, SerpApi, according to its website, along with Meta, Samsung and Nvidia.

Reddit is one of the most popular online platforms, with the company reporting over 110 million daily active users and more than 22 billion posts and comments. As such, it's become one of the most popular sources of the kind of human-created data that AI companies seek. Reddit has struck deals with OpenAI and Google to license its data. It's also sued Anthropic for misusing its data.

Perplexity was also recently sued for copyright infringement by Encyclopedia Britannica, which owns the Merriam-Webster dictionary.

Perplexity said in a statement on Reddit that it doesn't need to license content as it doesn't train foundational AI models. Reddit responses are used in Perplexity's answers, which Perplexity said were "lawfully" accessed.

Copyright is one of the most contentious legal issues for AI companies. They need massive quantities of human-generated content -- like Reddit posts -- to train and refine their AI models. Much of that content is copyrighted, which typically requires the company to negotiate with the rights holder in order to license and use it.

While some AI companies have struck multimillion-dollar licensing deals with publishers like Axel Springer, others have said their use of copyrighted material is fair use and therefore doesn't require them to pay. A series of lawsuits is duking out the specifics in court, with Meta and Anthropic notching fair use victories this summer. (Disclosure: Ziff Davis, CNET's parent company, in April filed a lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)