Discover more from The Gap
The Authors Guild Class Action Lawsuit Against OpenAI
A quick look at an important copyright case against OpenAI
The Authors Guild – the oldest and largest organization for writers in the US - is behind a recent class action lawsuit against OpenAI filed in New York District Court on September 19. The lawsuit is joined by some of the world's most renowned novelists, including Games of Thrones-author, George R.R. Martin, the great American novelist, Jonathan Franzen, the author of many best-selling legal thrillers, John Grisham, and the best-in-class short-story writer, George Saunders. Below is a full list of all the 17 authors who have joined the lawsuit:
Christina Baker Kline
Maya Shanbhag Lang
George R.R. Martin
It’s not a secret that all the well-known generative AI models are enabled by the use of unfathomable amounts of copyrighted material as training data. Material that no one is paying for, except for unsuspecting end-users who pay handsome fees to the AI model providers without any access to know “how the cake is made”. The creative and artistic industry is now beginning to push back hard against this unfortunate development.
So far, there have been three other lawsuits this year against OpenAI concerning their unlicensed use of copyrighted text in GPT-3/GPT-3.5/GPT-4.
Earlier this month, Pulitzer prize-winner Michael Chabon, together with writers David Henry Hwang, Matthew Klam, Rachel Louise Synder, and Ayelet Waldman, filed a case against OpenAI in San Francisco on similar grounds as the case by Authors Guild.
In June, a class action lawsuit was filed by Paul Tremblay and Mona Awad in San Franciso, followed by another one in July, by Sarah Silverman, Chris Golden, and Richard Kadrey. Both of these cases are handled by the copyright lawyers Matthew Butterick and Joseph Saveri.
I have previously written an in-depth post about Butterick and Saveri's class action lawsuit against Stable Diffusion, Midjourney, and DeviantArt. I concluded with a high degree of confidence that the complaint would not lead to precedence of any kind. The underlying technology was grossly misrepresented by the plaintiffs and the legal argumentation was foggy and weak. The plaintiffs had not registered their copyrighted works which would normally be a prerequisite to take legal action on a case such as the one in question. Furthermore, the plaintiffs did not prove or even try to prove how any of their allegedly copyrighted art had been infringed by the AI image model.
In contrast, the class action lawsuit brought by The Authors Guild seems to be much more rooted in reality and promising. For instance, “generative AI” is defined in the complaint as “(…) systems that are capable of generating “new” content in response to user inputs called “prompts”. The main argument in the Stable Diffusion litigation case was that Stable Diffusion “is a 21st-century collage tool that remixes the copyrighted works of millions of artists whose work was used as training data”. As all defendants, Stability.AI, Midjourney, and DevianArt, pointed out in their responses, the outputs of Stable Diffusion are objectively speaking unique with very little to zero resemblance to images used in the training process. Authors Guild does not spend any time on this issue, as they recognize that the generative AI systems generate "new" content.
Instead, the Authors Guild focuses on copyright infringements in the AI models training process. They claim that the word “training” is “a technical-sounding euphemism for “copying and ingesting” as the process “by definition involve[s] the reproduction of entire works or substantial portions thereof” with reference to a policy report by the U.S. Patent and Trademark Office.
Although OpenAI has famously, notoriously despite its name, kept information about GPT-3’s, GPT-3.5, and GPT-4’s training data a proprietary secret, the company has according to the complaint admitted that training data for its models have been derived from copyrighted work. Furthermore, OpenAI has admitted that large language models (LLMs) “require[s] large amounts of data,” and that “analyzing large corpora” of data “necessarily involves first making copies of the data to be analyzed.” Finally, OpenAI has admitted that, if it refrained from using copyrighted works in its LLMs’ “training,” it would “lead to significant reductions in model quality.” According to the plaintiffs, these statements from OpenAI amount to an open admission of reproducing copyrighted work as a central part to the quality of its products. In other words, an obvious infringement of copyright.
Up until very recently, ChatGPT could return verbatim quotations from books, but it has now been restrained from doing so by its programmers (likely because of the other lawsuits). However, it can produce detailed summaries of books which suggests that it has been “ingested” with entire books in its training.
The Authors Guild speculates about whether "Books2" – an undisclosed data source OpenAI said it used in training GPT-3 – are ebook files downloaded from large pirate book repositories. It’s going to be very interesting to see how OpenAI will respond to these claims. In all likelihood, OpenAI has indeed trained GPT-3, GPT-3.5, and GPT-4 with large amounts of e-books, unlicensed and freely downloaded from torrent sites or by similar methods. Otherwise, it’s hard to imagine how OpenAI could have collected such a huge amount of training data from books. The disclosed size of the dataset Books2 is 55 billion tokens which indicates that it comprises over 100,000 books.
In July, the Danish anti-piracy group Rights Alliance succeeded in taking down another dataset called Books3, which consisted of nearly 200,000 illegally copied books. The dataset was compiled by an independent AI researcher and open-source advocate, Shawn Presser. In response to the takedown he said: “Without Books3, we live in a world where nobody except OpenAI and other billion-dollar companies have access to those books—meaning you can't make your own ChatGPT” (Gizmodo). Books3 is similar in size to Books2 and was used to train other AI models such as Meta’s LlaMA, Bloomberg’s BloombergGPT, and EleutherAI’s GPT-J (The Atlantic).
I believe that the Authors Guild and the 17 renowned novelists have a strong and important case against OpenAI. I am looking forward to following the future developments, both in this case and in the broader reckoning of generative AI models and artistic rights.
Reads of the Week
The AI Heretic - Aki Ito, Septemper 18, 2023 (Business Insider)
Your NFTs Are Actually — Finally — Totally Worthless - Miles Klee, Septer 20, 2023 (Rolling Stone)
ChatGPT can now see, hear, and speak - OpenAI, September 25, 2023 (Open AI)