Why the Training of Generative AI Models Is a Violation of Copyright Law

The case for requiring licensed AI training based on new research.

Oct 01, 2024

∙ Paid

Statute of Rome’s first emperor, Augustus Ceasar, who is apparently Mark Zuckerberg’s role model. The BigTech companies are in some ways comparable to modern empires as the tech CEOs are to modern imperialists.

One of my favorite pet topics over the last year or so has been AI and copyright.

Critical voices, and I am afraid they are right, claim that legal analysis of this stuff barely matters.

BigTech companies have proven time and time again that they can outpace the law by avoiding liability for dubious actions. Through technological innovation, the big technology companies –now monopolists – have managed to create their own laws, their own precedence, and dictate the norms and standards in the digital industry.

Former CEO of Google, Eric Schmidt, was not shy to admit this much in a controversial talk he gave to students at Stanford University, while unaware it was being livestreamed:

“If TikTok is banned, here's what I propose each and every one of you do. Say to your LLM the following:
Make me a copy of TikTok, steal all the users, steal all the music, put my preferences in it, produce this program in the next 30 seconds, release it and in one hour, if it's not viral, do something different along the same lines.
That's the command.
Boom, boom, boom, boom.
(..)
by the way, I was not arguing that you should illegally steal everybody's music.
What you would do if you're a Silicon Valley entrepreneur, which hopefully all of you will be, is if it took off, then you'd hire a whole bunch of lawyers to go clean the mess up, right?
But if nobody uses your product, it doesn't matter that you stole all the content.
And do not quote me.”

But quoted Schmidt was, and even though the livestreaming was deleted soon after its upload, the internet never forgets.

Schmidt’s advice nearly fits what OpenAI and other foundation model providers have already done and are currently in the process of doing. Specifically, the companies are using the works of others without authorization at an unfathomable scale to create new tools that are designed to compete with and replace artists, researchers, and all kinds of knowledge workers.

In the words of the great Ted Gioia, this is a parasite business:

“You might even say we live in a society where parasitical behavior is rewarded more than actual creativity.”

Yet, who will stop it?

The golden goose of American tech innovation, OpenAI, is targeting a $150 billion valuation. Microsoft and BlackRock are planning to launch a +$30 billion fund to invest in AI infrastructure and energy projects. ChatGPT is default in the new generation of Apple products. The world’s top-valued companies are investing in data centers and expensive GPUs to train AI models like there is no tomorrow. It will take a strong stomach for any judge to rule that the companies are essentially building a new trillion-dollar industry through illegal means. Even though this is likely the case, as we will look further into in a moment, there is a social and political reality to account for as well.

However, let’s not dismiss the possibility completely that a US court for example in a case like "The New York Times v. OpenAI and Microsoft” could rule in favor of the plaintiff. Here, the New York Times has claimed “billions of dollars in statutory and actual damages” and ordered the destruction “of all GPT or other LLM models and training sets that incorporate Times Works”. To borrow a good crypto term, a decision favoring the New York Times would be a rug pull.

Copyright laws in the US and the EU are harmonized through international treaties but due to different legal traditions (common law vs civil law), the outcome of AI copyright cases could be treated differently in the two continents. I imagine, even if the copyright lawsuits against OpenAI fall flat in the US, that European courts would be more willing to take a stance against the custom of training AI models with unlicensed material.

The Court of Justice of the European Union (CJEU) - the main authority on the interpretation of EU law - is evidently not afraid to take wild swings with the hammer if doing so protects human rights or democratic principles, regardless of the economic and political consequences its decisions may have. For reference, see the Schrems cases.

I am not in the business of fortune telling or reading the minds of judges but we can and should consider what the most objectively fair outcome of cases like "The New York Times v. OpenAI and Microsoft is.

Are the AI companies at odds with copyright laws when they train new models? I think, yes. That is also why, the companies provide no transparency about the data sets that were used for AI training. As RIAA writes in the complaint against Suno and Udio:

“After all, to answer that question honestly would be to admit willful copyright infringement on an almost unimaginable scale.”

I recently read a very well-reasoned piece on the subject, a multidisciplinary study, titled “Urheberrecht und Training generativer KI-Modelle - technologische und juristische Grundlagen” (“Copyright and Training generative AI models – technological and legal basics”) that explores the technical implications of training AI models under European copyright law.

The paper was commissioned by Initiative Urheberrecht (The Copyright Initiative) and written by Tim W. Dornis, Professor of Civil Law and Intellectual Property Law at University of Hanover, and Sebastian Stober, Professor of Artificial Intelligence at the University of Magdeburg and PhD in computer science.

In my view, the authors present a convincing case for why AI companies should either compensate or seek permission from copyright holders to use their data for training purposes.

Keep reading with a 7-day free trial

Subscribe to Futuristic Lawyer to keep reading this post and get 7 days of free access to the full post archives.