For most of society, the Internet was nothing short of revolutionary. Knowledge previously trapped in books or available only in particular localities could suddenly be placed on a computer, converted to strings of 1s and 0s, and distributed around the world at the speed of light. This reduction in information costs unlocked tremendous value worldwide, making education cheaper and more accessible, connecting firms with customers and resources, and easing communication between groups of people around the globe. The promise of this revolution was memorably captured by the startup-era battle cry of “Information Wants to be Free.”
But for copyright community, the Internet was a much more ominous development. Professor Dan Burk accurately captured these concerns when he labeled the Internet the World’s Biggest Copy Machine. Napster and other cutting-edge file-sharing technologies showed that copyright law was not ready for a world in which a protected work could be divorced from a physical medium. It took nearly a decade of litigation, negotiation, and evolution of social norms before the law—and the creative community—found a new equilibrium, clarifying that “free” meant libre but not necessarily gratis.
Generative AI is having a Napster moment. What the Internet did for information costs, the AI revolution could do for higher-level processes: analyzing and synthesizing vast amounts of data in ways that mimic human thought, but faster, augmenting human capabilities to improve productivity, efficiency, creativity, and discovery. But as with Napster, the initial euphoria regarding the promise of these amazing new tools is beginning to wane as developers come to terms with the legal implications of generative AI use and misuse. Large Language Models (LLMs) in particular rely on enormous quantities of information as inputs, much of which is protected by intellectual property law. As AEI Nonresident Senior Fellow Michael Rosen has extensively chronicled, the New York Times and others have sued OpenAI, alleging that use of Times articles to “train” ChatGPT constitutes copyright infringement.
OpenAI argues that any use of such material is fair use. At a high level, LLMs use publicly available material (including copyrighted works) to learn how language works. Massive quantities of text are broken down to analyze the relationships between words. These relationships form the basis of neural networks—loosely inspired by the human brain—to predict words in a sequence, which can then be used to generate text in response to a prompt. It’s not copyright infringement for an aspiring author to read a body of science fiction, even if the reader then writes a novel in the style of, say, Frank Herbert. (Of course, the resulting novel could be infringing if, for example, it involves a messianic son of an assassinated patriarch who battles the Emperor for control of unique resources on the planet Rune.) OpenAI relies in part on the Google Books case, in which the court found it was fair use for Google to scan millions of copyrighted works to create a search tool that displays snippets of those works in response to user prompts.
It’s far too early to predict the outcome of the Times case and other litigation. Just as Napster showed that the law was unprepared for copyrighted material divorced from a physical medium, the legal system has not yet considered machines designed to mimic human learning and brain structure. AEI Nonresident Senior Fellow Clay Calvert is likely correct that in journalism and related spaces, litigation may result in payment flows to owners of copyrighted material used as part of an engine’s training materials. That’s ultimately how streaming music replaced illegal filesharing.
If so, it’s important to consider the consequences of such a settlement, some of which played out in the Napster saga. First, this licensing could be a barrier to entry for new players in the AI space. A prospective AI startup already faces significant costs in terms of GPUs and other hardware required to build an LLM. An additional payment to copyright holders when assembling a training set could create an additional bottleneck that limits the number of players. Second, as Ironclad CEO Jason Boehmig predicts, the need to pay for third-party materials could advantage companies that have access to large proprietary data sets to reduce those training costs. Finally, it’s likely that existing AI tools will get worse before they get better, as models adjust to legal obligations.
These are not necessarily bad consequences. They are merely possible results as an emerging industry matures and the law catches up to technological advancement and produces a new post-disruption equilibrium. That equilibrium may look different for LLMs than for other AI engines such as image generators or generative music tools, just as music streaming looks different than image search. Generative AI is an exciting new frontier, but its story is just beginning.