AIs can generate near-verbatim copies of novels from training data

• The world’s top AI models can be prompted to generate near-verbatim copies of bestselling novels, raising fresh questions about the industry’s claim that its systems do not store copyrighted works. • A series of recent studies has shown that large language models from OpenAI, Google, Meta, Anthropic, and xAI memorize far more of their training data than previously thought. • AI and legal experts told the FT this “memorization” ability could have serious ramifications on AI groups’ battle against dozens of copyright lawsuits around the world, as it undermines their core defense that LLMs “learn” from copyrighted works but do notstore copies. • “There’s growing evidence that memorization is a bigger thing than previously believed,” said Yves-Alexandre de Montjoye, a professor of applied mathematics and computer science at Imperial College London. • AI groups have long argued that memorization does not happen. • In a 2023 letter to the US Copyright Office, Google said “there is no copy of the training data-whether text, images, or other formats-present in the model itself.” The AI industry also claims that training models on copyrighted books is “fair use,” arguing that the technology transforms the original work into something meaningfully new.

Article Summaries:

Recent studies reveal that leading large‑language models (LLMs) from OpenAI, Google, Meta, Anthropic, and xAI can produce near‑verbatim passages from copyrighted novels, challenging the industry’s claim that these models do not store training data. Stanford and Yale researchers prompted the models to complete sentences from 13 bestsellers, generating thousands of words that matched the originals. For example, Gemini 2.5 reproduced 76.8 % of Harry Potter and the Philosopher’s Stone, while Grok 3 produced 70.3 %. This memorization evidence could undermine AI firms’ defense in ongoing copyright lawsuits and raises questions about the fairness of training on copyrighted works.
Recent studies reveal that leading AI language models-OpenAI, Google, Meta, Anthropic, and xAI-can produce near‑verbatim excerpts from bestselling novels when prompted, contradicting industry claims that these systems do not store copyrighted text. Researchers at Stanford and Yale demonstrated that models could regurgitate thousands of words from 13 books, including A Game of Thrones and Harry Potter, with high accuracy (e.g., Gemini 2.5 reproduced 76.8 % of Harry Potter). This memorization challenges the AI sector’s defense that training on copyrighted works is “fair use” and that models merely “learn” without retaining copies, potentially impacting ongoing copyright lawsuits worldwide.

Sources:

https://arstechnica.com/ai/2026/02/ais-can-generate-near-verbatim-copies-of-novels-from-training-data/ (Latest source article published: 2026-02-23 15:38 UTC)