• Following backlash in aHacker News thread, Microsoft deleted a blog post that critics said encouraged developers to pirate Harry Potter books to train AI models that could then be used to create AI slop. • The blog, which is archivedhere, was written in November 2024 by a senior product manager, Pooja Kamath. • According to her LinkedIn, Kamath has been at Microsoft for more than a decade and remains with the company. • In 2024, Microsoft tapped her to promote a new feature that the blog said made it easier to “add generative AI features to your own applications with just a few lines of code using Azure SQL DB, LangChain, and LLMs.” What better way to show “engaging and relatable examples” of Microsoft’s new feature that would “resonate with a wide audience” than to “use a well-known dataset” like Harry Potter books, the blog said. • The books are “one of the most famous and cherished series in literary history,” the blog noted, and fans could use the LLMs they trained in two fun ways: building Q&A systems providing “context-rich answers” and generating “new AI-driven Harry Potter fan fiction” that’s “sure to delight Potterheads.” To help Microsoft customers achieve this vision, the blog linked to a Kaggle dataset that included all seven Harry Potter books, which, Ars verified, has been available online for years and incorrectly marked as “public domain.” Kaggle’stermssay that rights holders can send notices of infringing content, and repeat offenders risk suspensions, but Hacker News commenters speculated that the Harry Potter dataset flew under the radar, with only 10,000 downloads over time, not catching the attention of J.K. • Rowling, who famously keeps a strong grip on the Harry Potter copyrights.
Article Summaries:
- Microsoft removed a November 2024 blog post that urged developers to use a Kaggle dataset of the entire Harry Potter series to train generative‑AI models. The post, written by senior product manager Pooja Kamath, promoted a new Azure feature that could add AI capabilities with minimal code, citing the books as a “well‑known dataset” for “engaging” examples. Ars Technica’s investigation revealed the dataset was incorrectly marked public domain and had only about 10,000 downloads, raising copyright concerns. After backlash on Hacker News and a request from Ars, Microsoft deleted the blog and the dataset link.
Sources: