The lawsuit filed by the New York Times (NYT) against OpenAI and Microsoft for copyright infringement pits one of the great media institutions against the purveyor of a transformative new technology. Symbolically, the case promises a clash of the titans: labour-intensive human newsgathering against push-button information produced by artificial intelligence. But legally, the case represents something different: a classic instance of the lag between established law and emerging technology.
Copyright law, a set of rules that date back to the printing press, was not designed to cover large language models (LLM) like ChatGPT. It will have to be consciously evolved by the courts to fit our current circumstances.
The key legal issue in the case will be the doctrine known as fair use. Codified in the Copyright Act of 1976, fair use tells you when it’s acceptable to use text copyrighted by someone else. The fair use test has four factors. Educational and non-profit uses are more likely to be found to be fair use. Creative work gets more copyright protection than technical writing or news. The amount of the work that has been copied matters, as does the centrality to the copied work of the material that’s been copied. And perhaps most important for the NYT lawsuit, courts also consider whether the copying will harm the present or future market for the work copied.
Once you know the law, you can guess roughly how the legal arguments in the case are going to go. NYT will point to examples where a user asks a question of ChatGPT or Bing and it replies with something substantially like an NYT article. The newspaper will observe that ChatGPT is part of a business and charges fees for access to its latest versions, and that Bing is a core part of Microsoft’s business. NYT will emphasise the creative aspects of journalism. Above all, it will argue that if you can ask an LLM-powered search engine for the day’s news, and get content drawn directly from the NYT, that will substantially harm and maybe even kill the newspaper’s business model.
But OpenAI and Microsoft will be prepared for them. They’ll likely respond by saying that their LLM doesn’t copy; rather, it learns and makes statistical predictions to produce new answers. If I read an article in NYT and then write a Bloomberg opinion column on the same topic, that isn’t copyright infringement, even though I may have learned a great deal from the NYT piece and relied on that information to form my own opinion. For this reason, many copyright experts have been theorising that it cannot be a copyright violation for an LLM to learn from existing online material, even if it’s under copyright. The defendants can also be expected to argue that news consists of facts and should therefore be treated more permissively than creative material.
But Microsoft and OpenAI will have a hard time refuting the final point — that their product, which relies on newsgathering businesses like the NYT, will harm those businesses. ChatGPT and other LLMs cannot go out into the world to gather and vet new facts. They are restricted, for the foreseeable future, to “learning” from information that has already been published.
It follows that for LLMs to provide useful information, someone else — that is, a human — must first gather the information, ascertain that it is accurate, and publish it. This is the essence of newsgathering. It’s costly to get it right.
What’s more, to know that we can rely on news, we need it to come from an institution that we can trust — one with a track record and a reputation it has a business interest in upholding. Otherwise, we would not have news. We would have an iterative echo chamber untethered from reality.
Here is where the fundamental public interest in the maintenance of the free press becomes relevant to the fair use question. If you can get information more cheaply from an LLM than from NYT, you might drop your subscription. But if everyone did that, there would be no New York Times at all. Put another way, OpenAI and Microsoft need NYT and other news organisations to exist if they are to provide reliable news as part of their service. Rationally and economically, therefore, they ought to be obligated to pay for the information they are using.
Fitting this powerful public interest into copyright law won’t be simple for the courts. Literal copying is the easiest form of infringement to punish. In ordinary legal circumstances, if LLMs change words sufficiently to be summarising rather than copying, that weakens NYT’s case. Yet summaries in different words would still be sufficient to kill NYT and similar organisations — and leave us newsless.
The courts will need to be attuned to all this. The news infrastructure is already tottering. If we destroy it altogether, democracy will be the loser.