A Sensible Approach to Regulating AI in Copyright Law – The Input Phase
A new class-action lawsuit against Microsoft and OpenAI, Gary Marcus and Reid Southen in IEEE Spectrum, Public vs. Private datasets + Podcast with Sairam from Gradient Ascent
Before we get into today’s post.
As I wrote in the beginning of the year, I will experiment with the video format in 2024.
I recently had the pleasure of talking with
Sairam writes
, an excellent resource to learn about the nuts and bolts of machine learning and AI. His publication has been featured in a list by Wired over the most useful resources for learning about AI.Sairam is also the author of a book called AI for the Rest of Us, an illustrated, beginner-friendly guide on how to get started and navigate in the complex world of AI.
Below is our 20 minutes talk. Topics we discuss are:
How to get started with AI as a technical beginner
How to get started with AI as a lawyer
Google Gemini and how it will fare against GPT-4
What to do with hallucinations and AI-generated media
What’s ahead for 2024 in AI
Check it out.
A Sensible Approach to Regulating AI in Copyright Law – The Input Phase
Introduction
After a delayed Christmas present from The New York Times, OpenAI and Microsoft are facing yet another lawsuit filed on January 5 this time by two independent journalists, Nicholas A. Basbanes and Nicholas Ngagoyeanes. The proposed class-action lawsuit takes aim at how OpenAI with help from Microsoft has trained GPT-3, GPT-3.5, and GPT-4 by using the work of millions of creators, including the work of the plaintiffs, without permission or compensation. This paragraph from the complaint sums up the core claim well:
“Professional writers like Mr. Basbanes and Mr. Gage have limited capital to fund their research. They typically self-fund their projects, which are often supplemented by advances on royalties from prospective publishers, and those advances are then deducted from any future earnings their books may generate. Working thusly on their own resources, they travel wherever they must to locate and secure access to primary research materials that have been previously unexamined and to conduct essential interviews. If they find it necessary to use certain copyrighted materials in their published works, it is the authors—not their publishers—who pay for the privilege of doing so. Yet, in stark contrast, Defendants, with ready access to billions in capital, simply stole Plaintiffs’ copyrighted works to build another billion+ dollar commercialindustry. It is outrageous.“
It can be tricky to make predictions for the year ahead but one fail-safe prediction I am willing to make is that lawsuits against OpenAI, Microsoft, and other generative AI providers will continue to pile up in 2024. Likely, we will see several more, both from struggling creatives and established enterprises. According to the many plantiffs, large language models (LLMs) are “copyright theft machines”. In this post, we will look a bit closer at whether that characterization is fair.
The Input Phase Question
Before the New Year, I wrote a post about the regulation of AI model outputs. Considering the traditions of copyright law and the development of foundation models it seems clear to me that liability for copyright infringements as a main starting point should lie with the user of a model, not the provider. Indeed, that is also the legal framework big AI model providers have aggressively lobbied for.
I was a bit surprised to learn via Gary Marcus on Substack how blatantly Midjourney and DALL-E 3 can recreate trademarks and copyrighted works from their training data. The first image below shows an image created by Midjourney, alarmingly similar to the iconic still image from the Joker movie. The second image shows how DALL-E 3 can regenerate SpongeBob SquarePants, even when not explicitly prompted to do so.
On the question of why and how these infringements occur, it’s difficult to answer. To quote from Gary Marcus and Reid Southen in IEEE Spectrum:
“Such questions are hard to answer with precision, in part because LLMs are “black boxes”—systems in which we do not fully understand the relation between input (training data) and outputs. What’s more, outputs can vary unpredictably from one moment to the next. The prevalence of plagiaristic responses likely depends heavily on factors such as the size of the model and the exact nature of the training set. Since LLMs are fundamentally black boxes (even to their own makers, whether open-sourced or not), questions about plagiaristic prevalence can probably only be answered experimentally, and perhaps even then only tentatively.”
However, even though we don’t fully understand how generative AI models work under the hood, there are a few things we do understand. We know that a reproduction of the model’s training data takes place during the training process. Here, from a report by the U.S. Patent & Trademark Office (2020), Public Views on Artificial Intelligence and Intellectual Property Policy:
“The ingestion of copyrighted works for purposes of machine learning will almost by definition involve the reproduction of entire works or substantial portions thereof.”
However, that doesn’t mean that AI training is copyright-infringing by default. Under US law, this will be determined under the fair use doctrine in 17 U.S.C. § 107. The following four factors are weighty considerations in this assessment:
the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
the nature of the copyrighted work;
the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
the effect of the use upon the potential market for or value of the copyrighted work.
Under EU law, I believe the corresponding test would be if the AI training is considered a “temporary act of reproduction” under the Infosoc Directive, Article 5 (1). Temporary acts of reproduction are exempted from the copyright holder’s exclusive reproduction right in Article 2 under the following circumstances:
the temporary reproduction is “an integral and essential part of a technological process”
Whose sole purpose is to enable either (a) “a transmission in a network between third parties by an intermediary, or (b) a lawful use of a work
which have no “independent economic significance”
Whether AI training falls under the fair use exemption in US law or is exempted from the right holder’s exclusive right to reproduce his or her work under EU law, can be debated both ways. Another important point that I thought was often overlooked in the coverage of The New York Times lawsuit against OpenAI and Microsoft, concerns whether the copyright-protected material was lawfully acquired in the first place.
Public vs. Private Training Data
Keep reading with a 7-day free trial
Subscribe to Futuristic Lawyer to keep reading this post and get 7 days of free access to the full post archives.