The New York Times vs OpenAI: the next frontier for fair value
In our last article on AI, Next on your watchlist: LLM-driven content discovery, we looked at how Large Language Models (LLMs) and Generative AI (Gen AI) can be used to improve the customer experience of streaming services, and allow smaller players to economically rival industry leaders. Whilst Gen AI will undoubtably provide benefits for the creative industries, the landscape is rapidly evolving, and some parts of the media sector have been raising different issues. This includes the question of Gen AI providers using copyrighted material in training LLMs, a concern flagged by many members of the news industry. A recent high profile development is of course the New York Times (NYT) announcing it is suing OpenAI and Microsoft over alleged copyright infringement for ‘billions of dollars’ in damages.
Understanding the controversy: NYT vs. OpenAI
One of the key stages in building an LLM is to train it on a vast dataset consisting of text from books, websites, and articles. As a basic rule of thumb, LLMs become more powerful and advanced the more data they are trained on. This has led to a data training arms race, with Gen AI companies looking to train their LLMs on larger and increasingly sophisticated datasets. Chat GPT-4, widely regarded as the most powerful LLM, was reportedly trained using text databases from the internet, totalling 300 billion words, or around 570 GB of data.
The NYT's lawsuit hinges on the allegation that the Gen AI companies trained their LLMs on ‘millions’ of copyrighted articles, including NYT articles that sit behind a paywall and cannot be accessed by users for free. OpenAI argues that this is ‘fair use’ because it serves a ‘transformative’ purpose, and claims that multiple groups, including academics, have advocated to the US Copyright Office that laws should permit training models on copyrighted content to enable AI innovation and investment (approaches to copyright exceptions vary in other jurisdictions, including the UK and EU).
In addition, OpenAI permits online content providers to ‘opt out’ of Chat GPT using their data for training and notes that NYT has opted out since August 2023 (but the core model underpinning Chat GPT-4 was trained seemingly before NYT was aware of the issue). The NYT argues that the outputs of Gen AI models compete with and closely mimic the inputs used to train them and are therefore substitutes, so ‘fair use’ does not apply.
The legal and policy landscape: a race against time
Policy makers worldwide are facing a challenge given the rapid evolution of Gen AI technologies. The focus to date has been on AI safety and responding to risk, including the United States' October 2023 Executive Order, the UK's ‘pro-innovation’ AI white paper, and the EU AI Act. The UK Intellectual Property Office has been examining AI copyright questions, initially proposing a broad copyright exception for text and data mining to facilitate LLM training. But the UK government rowed back after concerns were raised by the creative industries and is now facilitating industry roundtables with the ambition of reaching a code of practice on copyright and AI, ‘to enable the AI and creative sectors to grow in partnership’.
These governmental efforts reflect a delicate balancing act: crafting a robust, equitable approach to regulation to safeguard users from potential AI-related harms, incentivising creation and innovation in the creative industries, and fostering AI innovation and positioning their nations as global AI powerhouses. The strategies of the US and UK, in particular, reveal a keen awareness of this balance, as they vie to maintain their AI innovation leadership amidst significant investments in China, seeking to challenge their dominant roles in the global AI landscape.
For the news sector, Gen AI is the latest digital disruption
Many argue that news is the media sector that has found the transition from traditional to online media the most challenging, citing the dramatic decline in print advertising revenues as digital advertising has grown (there’s a lot of complexity and nuance here, but overall, it’s unarguable that newspaper business models have struggled since the explosion of the internet). Nonetheless, some premium news outlets have begun to experience revenue growth after substantial investment in digital-first strategies, primarily through digital subscription models. The NYT exemplifies this trend (as shown in the figure below). However, the advent of Gen AI presents a new challenge to this business model. Separately of course, AI also brings the chance to employ new AI tools in products and in the newsroom.
Figure: Selected news provider revenues, 2008-2022, indexed on 2008=100
Although there was significant disruption from the mid-2000s, legal frameworks around the negotiations between news providers and digital platforms are much more recent. These developments, from Australia, the EU, Canada and now the UK, apply to the commercial negotiations between news providers and online platforms with significant bargaining power, often with backstop arbitration if commercial agreements cannot be reached. We anticipate lawmakers will adopt a similar ‘commercial first’ approach for LLM content licensing. However, the timing is important – given the rapid development of Gen AI, both news providers and AI companies will be vigilant about any precedents set.
How we can help
This rapidly evolving landscape requires expert navigation. Oliver & Ohlbaum Associates are the media industry’s leading advisers on fair value issues. Our methodology is grounded in over 20 years’ experience designing regulatory frameworks and supporting commercial deals and arbitration. From advising on the sustainability of the news sector to the contribution of digital platforms to digitalisation in the creative industries, to valuing carriage deals for TV channels and VOD, sports rights valuations, music rights payments and setting CMO tariffs, our fair value expertise across the sector is unique.