Microsoft AI Under Fire for Allegedly Consuming 200,000 Pirated Books — Authors Fight Back

As the world of artificial intelligence is rapidly evolving, a controversy involving Microsoft's AI models has ignited a heated debate about intellectual property rights.

Authors are speaking out, alleging that these powerful AI systems were trained on a staggering 200,000 pirated books.

The Allegations Against Microsoft

Microsoft faces accusations from a collective of authors who claim the tech giant developed an artificial intelligence model using around 200,000 unlawfully copied books. This marks the latest development in the ongoing legal dispute over copyrighted material involving creators and technology firms.

Authors, including Kai Bird, Jia Tolentino, and Daniel Okrent, have brought forward claims that Microsoft taught its Megatron AI to generate human-like responses by utilising pirated digital copies of their books.

Microsoft released a generalist multi-agent AI system that can complete real-world tasks on its own.

It consists of a team of AI agents that work together to solve complex problems.

100% opensource. pic.twitter.com/7lKrQ1vwKg
— Shubham Saboo (@Saboo_Shubham_) November 12, 2024

The legal action, initiated by the authors in a New York federal court on Tuesday, joins several other significant cases. These high-stakes challenges come from authors, news organisations, and other copyright holders, all targeting technology firms such as Meta Platforms, Anthropic, and Microsoft-backed OpenAI, concerning the alleged misuse of their content in AI training.

The authors are seeking a court order to stop Microsoft's alleged infringement and demanding statutory damages of up to $150,000 (£109280.46) for every work they claim the company misused.

Understanding Generative AI and Its Training

Generative artificial intelligence (Gen AI) tools, such as Megatron, are designed to create various outputs, including text, music, pictures, and videos, based on user commands. To enable this, software engineers build vast collections of media, which are then used to program the AI to generate similar material.

The complaint from the writers alleged that Megatron, a Microsoft AI product designed to provide text answers to user prompts, was trained using a database of nearly 200,000 pirated books.

Another AI book training case kicked off yesterday. This time it’s Microsoft and its Megatron LLM, trained on the “notorious collection” of pirated books called “Books3,” in the crosshairs. pic.twitter.com/iuGfaNk7EF
— Rob Freund (@RobertFreundLaw) June 26, 2025

According to The Guardian, the complaint detailed how Microsoft allegedly used the pirated dataset to build a 'computer model that is not only built on the work of thousands of creators and authors, but also built to generate a wide range of expression that mimics the syntax, voice, and themes of the copyrighted works on which it was trained.'

A Wave of Copyright Lawsuits

Just a day before the complaint against Microsoft, a California federal judge issued a mixed ruling regarding Anthropic. The judge found that training its AI systems with authors' material fell under fair use within US copyright law, yet left open the possibility of liability for the company's alleged pirating of their books.

The ruling was the first in the US to address the legality of training generative AI with copyrighted materials acquired without consent. Coincidentally, on the same day the Microsoft complaint was filed, a California judge ruled in favour of Meta in a similar dispute over the use of copyrighted books for AI model training. This decision, however, was reportedly influenced more by the plaintiffs' inadequate arguments than by the tech company's robust defence.

Wow, ChatGPT claims it doesn't "have access to specific passages from copyrighted books," but then also offers to summarize one.

How can you summarize something you don't have access to? It does have access to many ©️ books, it's just trained to say this for legal cover. pic.twitter.com/yYrZHZsHzp
— The Short Straw (@short_straw) September 15, 2023

The legal battle concerning copyright and AI began shortly after the introduction of ChatGPT and extends across various media types. The New York Times has taken legal action against OpenAI, citing copyright infringement of its article archives; similarly, Dow Jones, the company behind the Wall Street Journal and the New York Post, has initiated a comparable lawsuit against Perplexity AI.

The legal challenges have broadened to include major record labels, who are now suing companies behind AI-driven music generation tools. In another instance, photography firm Getty Images has filed a claim against Stability AI concerning its text-to-image offering.

Recently, Disney and NBC Universal have also taken legal action against Midjourney, the creator of a popular AI image generator, for allegedly misusing some of the world's most iconic film and television characters.

The Tech Industry's Defence

Tech companies argue that their use of copyrighted material to create new, transformative content falls under the fair use doctrine. They contend that being compelled to compensate copyright holders for their work could severely hinder the development of the AI industry.

Sam Altman, the CEO of OpenAI, has even stated that the development of ChatGPT would have been 'impossible' without access to copyrighted works.

Microsoft Artificial Intelligence