Is Perplexity Stealing Content? News Outlets Cry Foul Over AI Chatbot’s Web Scraping Practices

All copyrighted images used with permission of the respective copyright holders.

The Thin Line: Perplexity AI, Fair Use, and the Future of Content

In the burgeoning age of generative AI, where chatbots synthesize information from the internet to provide detailed answers, a critical question arises: where does the line lie between fair use and plagiarism, between routine web scraping and unethical summarization? This question has become particularly salient in the case of Perplexity AI, a startup combining a search engine with a large language model that aims to deliver comprehensive answers rather than mere links. Unlike its counterparts OpenAI’s ChatGPT and Anthropic’s Claude, Perplexity does not train its own foundational AI models, opting instead to leverage readily available open or commercial models to translate information garnered from the internet into answers.

However, a series of accusations levied against Perplexity in June 2023 suggest that the startup’s approach treads a precarious path bordering on ethical transgression. Forbes highlighted Perplexity’s alleged plagiarism of one of its news articles within the startup’s beta Perplexity Pages feature. Wired subsequently accused Perplexity of illicitly scraping its website, along with other reputable publications.

Despite these allegations, Perplexity, which was reportedly seeking $250 million in funding at a near-$3 billion valuation as of April 2023, asserts its innocence. Backed by Nvidia and Jeff Bezos, the company maintains that it has honored publishers’ requests to refrain from scraping content and operates within the bounds of fair use copyright laws.

This complex situation revolves around the fine interplay of two core concepts: the Robots Exclusion Protocol and fair use in copyright law.

Surreptitiously Scraping Web Content:

Wired’s investigation, published on June 19, 2023, alleges that Perplexity has disregarded the Robots Exclusion Protocol, a standard employed by websites to signal their aversion to web crawlers accessing or utilizing their content. Wired reported observing a machine tied to Perplexity engaging in this behavior on its own news site and across other Condé Nast publications.

Independent developer Robb Knight corroborated Wired’s findings through a similar experiment, concluding that Perplexity indeed bypassed the protocol. Both Wired reporters and Knight tested their suspicions by asking Perplexity to summarize a series of URLs and then observing their servers as an IP address linked to Perplexity visited those websites. Perplexity then "summarized" the text from these URLs – even going so far as to reproduce the text verbatim from a dummy website with limited content crafted for this purpose by Wired.

This is where the intricacies of the Robots Exclusion Protocol come into play. Web scraping, by definition, involves automated software known as crawlers scouring the web to index and collect information from websites. Search engines like Google utilize crawlers to incorporate web pages into their search results. Other companies and researchers also employ crawlers to gather data from the internet for purposes such as market analysis, academic research, and, as has become increasingly apparent, training machine learning models.

Web scrapers adhering to the protocol first consult the "robots.txt" file within a website’s source code to ascertain permitted and prohibited actions. Today, scraping a publisher’s site to build massive training datasets for AI is typically forbidden. Search engines and AI companies, including Perplexity, have confirmed their compliance with this protocol, but they are not legally obligated to do so.

Perplexity’s head of business, Dmitry Shevelenko, countered Wired’s claims, asserting that summarizing a URL is not synonymous with crawling. "Crawling is when you’re just going around sucking up information and adding it to your index," he explained. Shevelenko further argued that Perplexity’s IP might appear as a visitor to a website "otherwise kind of prohibited from robots.txt" only when a user includes a specific URL in their query, which "doesn’t meet the definition of crawling."

"We’re just responding to a direct and specific user request to go to that URL," Shevelenko insisted.

In essence, Perplexity argues that when a user manually provides a URL to its AI, it is not acting as a web crawler but rather as a tool assisting the user in retrieving and processing the requested information.

However, for Wired and numerous other publishers, this distinction appears inconsequential. Visiting a URL and extracting its information for summarization, especially if done thousands of times daily, bears striking resemblance to scraping.

Plagiarism or Fair Use?

Wired and Forbes have also accused Perplexity of plagiarism. Ironically, Wired alleges that Perplexity plagiarized the very article that scrutinized the startup for surreptitiously scraping its web content.

Wired reporters observed that the Perplexity chatbot "produced a six-paragraph, 287-word text closely summarizing the conclusions of the story and the evidence used to reach them." One sentence within the chatbot’s output mirrored a sentence from the original article, which Wired considers plagiarism. According to the Poynter Institute’s guidelines, plagiarism might occur when an author (or AI), uses seven consecutive words from the original source work.

Forbes, not to be outdone, accused Perplexity of plagiarism as well. The news site published an investigative report in early June unveiling how Google CEO Eric Schmidt’s new venture was actively recruiting and testing AI-powered drones for military applications. The subsequent day, Forbes editor John Paczkowski took to X (formerly Twitter) to declare that Perplexity had republished the scoop as part of its beta feature, Perplexity Pages.

Perplexity Pages, currently accessible to a limited number of subscribers, promises to enable users to transform research into "visually stunning, comprehensive content," according to Perplexity. Examples of such content on the site are created by the startup’s employees and include articles like "A beginner’s guide to drumming" or "Steve Jobs: visionary CEO."

"It rips off most of our reporting," Paczkowski asserted. "It cites us, and a few that reblogged us, as sources in the most easily ignored way possible."

Forbes reported that numerous posts curated by Perplexity’s team bear "strikingly similar wording" to original stories from multiple publications, including Forbes, CNBC, and Bloomberg. Forbes revealed that these posts garnered tens of thousands of views, yet omitted any attribution by name within the article text. Perplexity’s articles instead included attributions in the form of "small, easy-to-miss logos that link out to them."

Furthermore, Forbes claimed that the post about Schmidt contains "nearly identical wording" to Forbes’ scoop. The aggregation also included an image created by Forbes’ design team, which appeared to have undergone minor modifications by Perplexity.

Perplexity CEO Aravind Srinivas responded to Forbes’ accusations at the time, stating that the startup would enhance the prominence of source citations in the future—a solution that is not foolproof, as citations themselves are prone to technical difficulties. ChatGPT and other models have displayed a tendency to fabricate links, and since Perplexity relies on OpenAI models, it is susceptible to such "hallucinations." In fact, Wired reported observing Perplexity hallucinating entire stories.

Aside from acknowledging Perplexity’s "rough edges," Srinivas and the company have largely defended Perplexity’s right to utilize such content for summaries.

This is where the subtleties of fair use come into play. Plagiarism, although frowned upon, is not strictly illegal.

According to the U.S. Copyright Office, it is legal to employ limited portions of a work, including quotes, for purposes such as commentary, criticism, news reporting, and scholarly reports. AI companies like Perplexity argue that providing a summary of an article falls within the bounds of fair use.

"Nobody has a monopoly on facts," Shevelenko asserted. "Once facts are out in the open, they are for everyone to use."

Shevelenko drew a parallel between Perplexity’s summaries and the common practice of journalists drawing upon information from other news sources to strengthen their own reporting.

Mark McKenna, a law professor at the UCLA Institute for Technology, Law & Policy, told TechCrunch that the situation is intricate and lacks straightforward answers. In a fair use case, courts would assess whether the summary heavily relies on the original article’s expression versus merely its ideas. They would also examine whether reading the summary might serve as a substitute for reading the original article.

"There are no bright lines," McKenna said. "So [Perplexity] saying factually what an article says or what it reports would be using non-copyrightable aspects of the work. That would be just facts and ideas. But the more that the summary includes actual expression and text, the more that starts to look like reproduction, rather than just a summary."

Unfortunately for publishers, if Perplexity is not using full expressions (and seemingly, in some cases, it is), its summaries might not be considered a violation of fair use.

How Perplexity Aims to Protect Itself:

AI companies like OpenAI have forged media deals with various news publishers to gain access to their current and archival content for training their algorithms. In exchange, OpenAI promises to surface news articles from these publishers in response to user queries in ChatGPT. (However, even this arrangement has its share of unresolved issues, as reported by Nieman Lab last week.)

Perplexity has refrained from announcing its own set of media deals, possibly awaiting the accusations against it to subside. However, the company is reportedly "full speed ahead" on a series of advertising revenue-sharing arrangements with publishers.

The idea is for Perplexity to begin incorporating ads beside query responses, and publishers whose content is cited in any answer would receive a portion of the corresponding ad revenue. Shevelenko revealed that Perplexity is also striving to empower publishers with access to its technology so they can build Q&A experiences and integrate functionalities like related questions directly into their sites and products.

But is this merely a façade for systemic IP theft? Perplexity is not the only chatbot posing a threat by summarizing content so thoroughly that readers find it unnecessary to click through to the original source material.

If AI scrapers like this persist in appropriating publishers’ work and repurposing it for their own businesses, publishers will find it increasingly difficult to earn ad revenues. This will eventually lead to a shortage of content to scrape, ultimately pushing generative AI systems to rely on synthetic data for training, potentially resulting in a vicious cycle of biased and inaccurate content.

The Ethics and Future of AI Content:

The ongoing saga of Perplexity AI exposes the perilous terrain navigated by AI developers as they push the boundaries of natural language processing and content summarization. The blurring lines between fair use and plagiarism, between scraping and summarizing, raise crucial ethical and legal questions that demand thoughtful consideration.

Publishers, grappling with the potential for their content to be exploited without proper compensation, are forced to confront the evolving landscape of information dissemination. The challenge lies in finding a balance between safeguarding their intellectual property and embracing the potential benefits of collaboration with AI companies.

Moreover, the question arises whether AI-generated content, trained on scraped and potentially plagiarized content, will ultimately erode the trust and authenticity of the digital information landscape. If AI systems become increasingly reliant on synthetic data, the consequences could be far-reaching, potentially impacting the credibility and reliability of online information in profound ways.

As AI technology continues to advance, the need for transparent and ethical guidelines governing the use of online content is becoming increasingly urgent. The dialogue must extend beyond legal frameworks and encompass a broader consideration of the societal impact of such practices. Striking a balance between innovation, ethical responsibility, and the protection of intellectual property is imperative for shaping a future where AI can serve as a valuable tool for information access, without sacrificing the integrity and reliability of online content.

Article Reference

Emily Johnson
Emily Johnson
Emily Johnson is a tech enthusiast with over a decade of experience in the industry. She has a knack for identifying the next big thing in startups and has reviewed countless internet products. Emily's deep insights and thorough analysis make her a trusted voice in the tech news arena.