Microsoft removes guide on how to train LLMs on pirated Harry Potter books

• Microsoft removed blog post that promoted using pirated Harry Potter books to train LLMs. • Post was written by senior product manager Pooja Kamath, advocating dataset for generative AI demos. • The blog linked to a Kaggle dataset incorrectly marked public domain, containing all seven Harry Potter books. • Hacker News backlash triggered Microsoft to delete the post and the dataset after Ars Tech verification. • The incident highlights risks of using copyrighted material for AI training without proper licensing. • Microsoft’s new Azure SQL DB, LangChain feature aims to simplify adding generative AI to apps.

Article Summaries:

Microsoft has removed a November 2024 blog post that outlined how developers could use a Kaggle dataset of the Harry Potter books to train large language models (LLMs). The post, written by senior product manager Pooja Kamath, promoted a new Azure feature and linked to a dataset that was incorrectly marked as public domain. Hacker News users criticized the post for encouraging the use of copyrighted material, prompting Ars Technica to contact the dataset’s uploader. After the backlash, Microsoft deleted the blog and the dataset was removed from Kaggle.

Sources:

https://arstechnica.com/tech-policy/2026/02/microsoft-removes-guide-on-how-to-train-llms-on-pirated-harry-potter-books/