Martin Paul Eve bio photo

Martin Paul Eve

Professor of Literature, Technology and Publishing at Birkbeck, University of London and Technical Lead of Knowledge Commons at MESH Research, Michigan State University

Email (BBK) Email (MSU) Email (Personal) Books Bluesky Github Stackoverflow KC Works Institutional Repo Hypothes.is ORCID ID  ORCID iD Wikipedia Pictures for Re-Use

Some remarks that will be presented at the SHARP plenary roundtable: AI in the Communications Circuit.

Last week, the CEO of Microsoft’s AI division said, in an interview with CNBC, that “I think that with respect to content that’s already on the open web, the social contract of that content since the ‘90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it. That has been ‘freeware,’ if you like, that’s been the understanding”. This reveals a deeply problematic understanding of copyright with respect to training data in large language models and poses a substantial risk to participation in open cultural systems, such as Wikipedia.

As a first note: it is possible, even probable, that US courts will rule differently on the use of in-copyright training data for LLMs to EU courts. The US has a criterion of “transformative use” as a potential claim for “fair use” when the proposed use is very different from the original. The EU and other jurisdictions have no such criterion. It is therefore possible that the US courts will rule the use fair via transformative arguments while the EU will rule that there is no such defence. How single jurisdictions might block LLMs that have been trained in a way that they deem illegal remains to be seen.

For many years, though, open culture advocates have been arguing for open licensing on cultural works, precisely so that others can take the work and do unexpected things with it. Examples in my own experience include format-shifting works (two of my books have had fan-made conversions to other formats) and amateur translations. But there has always been a subset of people for whom these unexpected re-uses are not a bonus, but a concern.

With the growth of machine learning and AI systems that have been trained on in-copyright material, we are likely to see a backlash against open cultural licenses and platforms, as these platforms will be seen as completely open fodder for AIs, which most, say, humanities scholars, have only encountered as aides to plagiarism and cheating. Wikipedia’s license, for instance, is the Creative Commons Attribution Sharealike 4.0 terms, which means that you are free to re-use the content on Wikipedia, so long as your downstream re-use is licensed under the same conditions. This, of course, does not preclude fair or transformative uses but it is a strong indication that, if AIs are trained on Wikipedia, they should themselves be releasing material under the CC BY-SA license.

And so the irony goes: AI systems of large language models were supposed to liberate content from its original silos, even while the application of such systems as discovery and search tools may be a mistake. They are, after all, statistical language models, not fact-retrieval systems. But I predict that, instead of such tools encouraging open culture and heralding a new method of data retrieval, we will instead see a mass backlash against open cultural licensing and a retreat into strong ideas of intellectual property that do not necessarily benefit the world, particularly in academic and scientific contexts.

In short: the actions of the large corporations who are training AI models on in-copyright material are provoking a backlash against the idea of open dissemination on the internet and web. This, in turn, will mean that such systems will be trained on data that contains falsehoods and that is of potentially lower quality, rather than the highest calibre of material. As a result, such systems might never live up to their promise. For some people, this is a desired outcome, but it also means that the general population will be unable to read the highest quality material without paying often unaffordable rates. Basically, we end up penalising society and culture just so that we can thwart the rise of the machines.