Why the Training of Generative AI Models Is a Violation of Copyright Law
The case for requiring licensed AI training based on new research.

One of my favorite pet topics over the last year or so has been AI and copyright.
Critical voices, and I am afraid they are right, claim that legal analysis of this stuff barely matters.
BigTech companies have proven time and time again that they can outpace the law by avoiding liability for dubious actions. Through technological innovation, the big technology companies –now monopolists – have managed to create their own laws, their own precedence, and dictate the norms and standards in the digital industry.
Former CEO of Google, Eric Schmidt, was not shy to admit this much in a controversial talk he gave to students at Stanford University, while unaware it was being livestreamed:
“If TikTok is banned, here's what I propose each and every one of you do. Say to your LLM the following:
Make me a copy of TikTok, steal all the users, steal all the music, put my preferences in it, produce this program in the next 30 seconds, release it and in one hour, if it's not viral, do something different along the same lines.
That's the command.
Boom, boom, boom, boom.
(..)
by the way, I was not arguing that you should illegally steal everybody's music.
What you would do if you're a Silicon Valley entrepreneur, which hopefully all of you will be, is if it took off, then you'd hire a whole bunch of lawyers to go clean the mess up, right?
But if nobody uses your product, it doesn't matter that you stole all the content.
And do not quote me.”
But quoted Schmidt was, and even though the livestreaming was deleted soon after its upload, the internet never forgets.
Schmidt’s advice nearly fits what OpenAI and other foundation model providers have already done and are currently in the process of doing. Specifically, the companies are using the works of others without authorization at an unfathomable scale to create new tools that are designed to compete with and replace artists, researchers, and all kinds of knowledge workers.
In the words of the great Ted Gioia, this is a parasite business:
“You might even say we live in a society where parasitical behavior is rewarded more than actual creativity.”
Yet, who will stop it?
The golden goose of American tech innovation, OpenAI, is targeting a $150 billion valuation. Microsoft and BlackRock are planning to launch a +$30 billion fund to invest in AI infrastructure and energy projects. ChatGPT is default in the new generation of Apple products. The world’s top-valued companies are investing in data centers and expensive GPUs to train AI models like there is no tomorrow. It will take a strong stomach for any judge to rule that the companies are essentially building a new trillion-dollar industry through illegal means. Even though this is likely the case, as we will look further into in a moment, there is a social and political reality to account for as well.
However, let’s not dismiss the possibility completely that a US court for example in a case like "The New York Times v. OpenAI and Microsoft” could rule in favor of the plaintiff. Here, the New York Times has claimed “billions of dollars in statutory and actual damages” and ordered the destruction “of all GPT or other LLM models and training sets that incorporate Times Works”. To borrow a good crypto term, a decision favoring the New York Times would be a rug pull.
Copyright laws in the US and the EU are harmonized through international treaties but due to different legal traditions (common law vs civil law), the outcome of AI copyright cases could be treated differently in the two continents. I imagine, even if the copyright lawsuits against OpenAI fall flat in the US, that European courts would be more willing to take a stance against the custom of training AI models with unlicensed material.
The Court of Justice of the European Union (CJEU) - the main authority on the interpretation of EU law - is evidently not afraid to take wild swings with the hammer if doing so protects human rights or democratic principles, regardless of the economic and political consequences its decisions may have. For reference, see the Schrems cases.
I am not in the business of fortune telling or reading the minds of judges but we can and should consider what the most objectively fair outcome of cases like "The New York Times v. OpenAI and Microsoft is.
Are the AI companies at odds with copyright laws when they train new models? I think, yes. That is also why, the companies provide no transparency about the data sets that were used for AI training. As RIAA writes in the complaint against Suno and Udio:
“After all, to answer that question honestly would be to admit willful copyright infringement on an almost unimaginable scale.”
I recently read a very well-reasoned piece on the subject, a multidisciplinary study, titled “Urheberrecht und Training generativer KI-Modelle - technologische und juristische Grundlagen” (“Copyright and Training generative AI models – technological and legal basics”) that explores the technical implications of training AI models under European copyright law.
The paper was commissioned by Initiative Urheberrecht (The Copyright Initiative) and written by Tim W. Dornis, Professor of Civil Law and Intellectual Property Law at University of Hanover, and Sebastian Stober, Professor of Artificial Intelligence at the University of Magdeburg and PhD in computer science.
In my view, the authors present a convincing case for why AI companies should either compensate or seek permission from copyright holders to use their data for training purposes.
Copyright and Training generative AI models – technological and legal basics
Dornis & Stober begin the paper by providing a comprehensive walkthrough of technical concepts such as unsupervised learning, supervised learning, reinforcement learning, ANNs, parameters, hyperparameters, embeddings, latent spaces, and more. It’s clear that the authors have a strong technical grasp of AI, and they lay out the foundation for understanding how the technology works.
Training a generative AI model involves several steps where “reproductions” of training data occur. Such reproductions would typically require permission from the copyright holder or a license. In EU copyright law, the copyright holder’s reproduction rights fall under the InfoSec Directive Article 2.
According to Dornis & Stober, the acts relevant to copyright when training a generative AI model are:
The collection, preparation, and storage of copyrighted works used for the AI training process. This part includes web scraping, copying the data into the model’s memory, and preparation, creation, and storage of the model’s training corpus.
During the training process (pre-training and fine-tuning), reproductions of copyrighted work are memorized inside of the model, although an explicit storage mechanism has not been created. In a legal sense, this is a reproduction of work
During the application of a generative AI model, (e.g. ChatGPT via OpenAI’s website) the fully trained model may copy and replicate works that have been processed during the model’s training in response to a prompt by the end user.
The very act of making a generative AI model available, either as implemented in a system for users or for downloading as a whole constitutes a “making available to the public” of the works that are replicated “inside” of the generative AI model. The relevant provision here is the copyright holder’s “right to communication” which follows from the InfoSec Directive Article 3.
In the EU we don't have a "fair use doctrine" like in the US that allows for limited exceptions to copyright on a case-by-case basis. However, we do have a number of exceptions and limitations to copyright that are exhaustively accounted for in the InfoSec Directive Article 5.
The most relevant of these exceptions in regard to the training of generative AI models is the exception of “temporary acts of reproduction” in the InfoSoc Directive Article 5 (1). Additionally, the Copyright Directive (DSM) which was adopted in 2019 and supplements the InfoSoc Directive contains an important exception for text and data mining (TDM) in Article 4.
Let’s look at each of these exceptions in turn and clarify why they don’t cover the training of generative AI models.
Temporary Acts of Reproduction
Article 5 (1) of the InfoSoc Directive says:
“Temporary acts of reproduction referred to in Article 2, which are transient or incidental [and] an integral and essential part of a technological process and whose sole purpose is to enable:
(a) a transmission in a network between third parties by an intermediary, or
(b) a lawful use
of a work or other subject-matter to be made, and which have no independent economic significance, shall be exempted from the reproduction right provided for in Article 2.”
CJEU has clarified in a case from 2009, Infopaq International A/S v Danske Dagblades Forening, how the word “transient” should be understood:
“An act can be held to be ‘transient’ within the meaning of the second condition laid down in that provision only if its duration is limited to what is necessary for the proper completion of the technological process in question, it being understood that that process must be automated so that it deletes that act automatically, without human intervention, once its function of enabling the completion of such a process has come to an end.”
Dornis & Stober point out that most of the relevant processes when training generative AI models, are not “transient” in nature. This is because the acts are not limited to a short duration of time, and deletion does not occur automatically but is always the result of the will of the operator of the AI. This is true for web scraping, the creation and storage of training data, pre-training, and fine-tuning. Only the creation of temporary copies in the computer's main memory during training and in the production of outputs can be classified as “transient” or perhaps even as “incidental”.
Furthermore, AI training does not have the purpose of a “transmission in a network between third parties through an intermediary” and the use can only be considered “lawful” if the rights holder has permitted it or there are no legal restrictions.
Article 5 (1)’s requirement that the reproduction “have no independent economic significance” means according to Dornis & Stober that the reproduction must not convey or create any advantage that goes beyond the permitted use and that no new possible use must arise. We all know that the purpose of training generative AI models is to profit economically from the underlying work in a way that is outside of the copyright holders’ interest and control.
Furthermore, we cannot say that the reproductions occurring when training a generative AI model have no “independent” economic significance since the individual training stages cannot be separated from each other without impairing the functionality of the model.
Overall, it is abundantly clear, that Article 5 (1) cannot encompass the training of generative AI models.
The Text and Data Mining Exception
The Copyright Directive (DSM) provides an exception or limitation for text and data mining (TDM) in Article 4.
According to Article 4 (1), Member States shall provide for an exception or limitation to the exclusive rights of the copyright owner “for reproductions and extractions of lawfully accessible works and other subject matter for the purposes of text and data mining”.
Article 4 (2) states “Reproductions and extractions made pursuant to paragraph 1 may be retained for as long as is necessary for the purposes of text and data mining.”
Article 4 (3) states that: “The exception or limitation provided for in paragraph 1 shall apply on condition that the use of works and other subject matter referred to in that paragraph has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online.”
Can we say that the training of generative AI models falls under this TDM exception?
First of all, the “reproductions and extractions” has to be based on “lawfully accessible works”. This must exclude for example paywalled newspaper articles and websites that have opted out from web crawling via robot.txts files. Unfortunately, as we know from the New York Times v. OpenAI and Microsoft- case and the recent controversy regarding Perplexity, the big foundation model providers tend to not respect paywalls or crawling opt-outs while collecting data for AI training. This is also problematic in regard to Article 4 (3) as quoted above.
Article 4 (2) implies that the “reproductions and extractions” should be deleted once they are no longer necessary for the purpose of TDM. This requirement is technically difficult if not impossible to comply with after the model has been trained. For instance, if a model “memorizes” by regurgitating text at length that was present in its training corpus, there is no feasible way to go back and delete the memorized text from the model’s data. According to Dornis & Stober, "It is occasionally argued that the deletion obligation is ineffective if permanent storage is necessary for functionality.” Whether or not it is technically possible for foundation models to comply with Article 4 (2) remains an open question.
Dornis & Stober discuss at length the difference between “syntax” and “semantics”.
“Syntax” deals with the structure and order of words, whereas “semantics” focuses on the meaning of individual words and how they relate to each other.
Recital 8 to DSM states:
“Text and data mining makes the processing of large amounts of information with a view to gaining new knowledge and discovering new trends possible. “
Dornis & Stober argue that the wording of the law refers exclusively to the evaluation of semantic information, not the syntax.
Traditionally, TDM is limited to searching and analyzing information and data sets which does not affect the copyright holder’s reproduction rights. Only semantic information is involved in this kind of TDM. But unlike traditional TDM, generative AI models evaluate both the semantics and the syntax of its training data to generate new creative outputs. To quote Dornis & Stober (translated):
“Above all, the syntax of the training data, which is vectorially replicated in the model, is recombined when the output is generated. The resulting products are intended and suitable to compete with the works used for training and to displace them from the market.”
To further clarify why the TDM exception does not cover the training of AI models, we can look at it from a historical perspective.
The DSM was drafted by the European legislators in 2016, long before generative AI became a known phenomenon. For the same reason, the EU legislators do not mention the terms “artificial intelligence” or “AI” once in the Directive. Dornis & Stober writes (translated):
“On this basis, it can hardly be assumed that there is a legislative will to declare the provisions of the DSM Directive applicable to all future developments in the field of AI technology with virtually no restrictions”
At the time DSM was drafted, the expectation in the legal literature was that there would not be any competition between or substitution of original works with the information and findings extracted via TDM. As we know, generative AI tools are often directly competing with and aiming to substitute the works they were trained on.
We should also note, that the AI Act does not expand the scope of TDM. To the contrary, Article 53 (1) (c) of the AI Act, requires “General purpose AI model providers" to "adopt a strategy to ensure compliance with Union copyright and related rights and, in particular, to identify and comply with a reservation of rights invoked in accordance with Article 4(3) of Directive (EU) 2019/790, including through state-of-the-art technologies (..)”
Dornis & Stober interpret Article 53 (1) (c) to mean that the consent of rights holders is required when training generative AI models.
Finally, there is a classical three-stage test which is anchored in numerous international copyright treaties such as the TRIPS Agreement, WIPO Copyright Treaty (WCT), and as here in Paragraph 2 of the Revised Berne Convention:
“The legislation of the member states reserves the right to permit reproduction in certain special cases, provided that such reproduction neither impairs the normal exploitation of the work nor unreasonably infringes the legitimate interests of the author.”
The three-stage test can be found in European copyright law in Article 5 (5) of the InfoSoc Directive.
Exceptions or restrictions to the rights of authors must according to this provision be measured in three stages:
“(1) the restriction only applies “in certain special cases”
(2) the "normal exploitation of the work or other protected subject matter is not impaired";
(3) the “legitimate interests of the right holder are not unduly violated.”
After a somewhat lengthy discussion that is too extensive to fully summarize here, Dornis & Stober conclude that the TDM exception applied to train generative AI models fails to fulfill the criteria on the three-stage test, namely because generative AI models can impair the normal exploitation of the works and violate the legitimate interests of the right holder.
The Big Picture
Based on my summary of the legal analysis by Dornis & Stober, it looks like training generative AI models involves several copyright infringements that are not mandated by EU law. We are now back to the point I raised at the beginning of this post, why does it matter?
Dornis & Stober make a couple of astute observations about the role of legislators that I fully agree with. In summary:
We can expect that human creativity will increasingly be suppressed by AI. For this reason, legislators cannot sit idle by while AI technology is further developed and distributed to the public.
Contrary to current forecasts, AI will likely not cause an increase in the creative production by humans. Rather, we can expect that the results of genuinely human creativity in many professional groups and industries will be replaced to a considerable extent by generative AI output.
The EU legislators should consider that ensuring uncompromising safeguarding of regulatory minimum standards is not about preventing AI innovation but rather about fair competitive conditions and appropriate compensation for the resources used.
I will leave it here. If you have any comments, or counterarguments, or want some of the points further elaborated, please reach out.