Microsoft
Authors are suing Microsoft, alleging its AI model, Megatron, was trained on 200,000 pirated books. This case joins a growing number of lawsuits where creators accuse tech firms of misusing copyrighted material for AI development. Pexels

As the world of artificial intelligence is rapidly evolving, a controversy involving Microsoft's AI models has ignited a heated debate about intellectual property rights.

Authors are speaking out, alleging that these powerful AI systems were trained on a staggering 200,000 pirated books.

The Allegations Against Microsoft

Microsoft faces accusations from a collective of authors who claim the tech giant developed an artificial intelligence model using around 200,000 unlawfully copied books. This marks the latest development in the ongoing legal dispute over copyrighted material involving creators and technology firms.

Authors, including Kai Bird, Jia Tolentino, and Daniel Okrent, have brought forward claims that Microsoft taught its Megatron AI to generate human-like responses by utilising pirated digital copies of their books.

The legal action, initiated by the authors in a New York federal court on Tuesday, joins several other significant cases. These high-stakes challenges come from authors, news organisations, and other copyright holders, all targeting technology firms such as Meta Platforms, Anthropic, and Microsoft-backed OpenAI, concerning the alleged misuse of their content in AI training.

The authors are seeking a court order to stop Microsoft's alleged infringement and demanding statutory damages of up to $150,000 (£109280.46) for every work they claim the company misused.

Understanding Generative AI and Its Training

Generative artificial intelligence (Gen AI) tools, such as Megatron, are designed to create various outputs, including text, music, pictures, and videos, based on user commands. To enable this, software engineers build vast collections of media, which are then used to program the AI to generate similar material.

The complaint from the writers alleged that Megatron, a Microsoft AI product designed to provide text answers to user prompts, was trained using a database of nearly 200,000 pirated books.

According to The Guardian, the complaint detailed how Microsoft allegedly used the pirated dataset to build a 'computer model that is not only built on the work of thousands of creators and authors, but also built to generate a wide range of expression that mimics the syntax, voice, and themes of the copyrighted works on which it was trained.'

A Wave of Copyright Lawsuits

Just a day before the complaint against Microsoft, a California federal judge issued a mixed ruling regarding Anthropic. The judge found that training its AI systems with authors' material fell under fair use within US copyright law, yet left open the possibility of liability for the company's alleged pirating of their books.

The ruling was the first in the US to address the legality of training generative AI with copyrighted materials acquired without consent. Coincidentally, on the same day the Microsoft complaint was filed, a California judge ruled in favour of Meta in a similar dispute over the use of copyrighted books for AI model training. This decision, however, was reportedly influenced more by the plaintiffs' inadequate arguments than by the tech company's robust defence.

The legal battle concerning copyright and AI began shortly after the introduction of ChatGPT and extends across various media types. The New York Times has taken legal action against OpenAI, citing copyright infringement of its article archives; similarly, Dow Jones, the company behind the Wall Street Journal and the New York Post, has initiated a comparable lawsuit against Perplexity AI.

The legal challenges have broadened to include major record labels, who are now suing companies behind AI-driven music generation tools. In another instance, photography firm Getty Images has filed a claim against Stability AI concerning its text-to-image offering.

Recently, Disney and NBC Universal have also taken legal action against Midjourney, the creator of a popular AI image generator, for allegedly misusing some of the world's most iconic film and television characters.

The Tech Industry's Defence

Tech companies argue that their use of copyrighted material to create new, transformative content falls under the fair use doctrine. They contend that being compelled to compensate copyright holders for their work could severely hinder the development of the AI industry.

Sam Altman, the CEO of OpenAI, has even stated that the development of ChatGPT would have been 'impossible' without access to copyrighted works.