Books, AI, and the Public Good: A New Grant
A Mellon-funded project to develop an ethical, public-interest way to incorporate books into artificial intelligence
by Dan Cohen
I’m delighted to report that the Mellon Foundation has awarded $350,000 to Authors Alliance and my library, the Northeastern University Library, to convene writers, publishers, librarians, technologists, and other stakeholders to explore the best way for books to be incorporated into artificial intelligence in an ethical and productive way that serves the public. In the coming year, we will consider a range of issues related to AI and books and develop a plan for a public-interest training commons.
Subscribers to this newsletter will recognize that this project is a fulfillment of the idea that I sketched out with Dave Hansen, the executive director of Authors Alliance, in “Books are Big AI’s Achilles’ Heel”:
The major AI vendors have sought to tap into this wellspring of human intelligence to power the artificial, although often through questionable methods…As the bedrock of our shared culture, and as the possible foundation for better artificial intelligence, books are too important to flow through these compromised or expensive channels. What if there were a library-managed collection made available to a wide array of AI researchers, including at colleges and universities, nonprofit research institutions, and small companies as well as large ones?…
A library-led training data set of books would diversify and strengthen the development of AI. Digitized research libraries are more than large enough, and of substantially higher quality, to offer a compelling alternative to existing scattershot data sets. These institutions and initiatives have already worked through many of the most challenging copyright issues, at least for how fair use applies to nonprofit research uses such as computational analysis.
Soon after the publication of that piece, we heard from the Mellon Foundation that they were interested in pursuing the idea, and we are thankful to the foundation for this generous funding.
Thinking carefully about the relationship between books and AI is as important as ever. As Bloomberg recently reported, even the big AI vendors are realizing that the lack of research-library-scale book collections is a problem:
The technology that underpins ChatGPT and a wave of rival AI chatbots was built on a trove of social media posts, online comments, books and other data freely scraped from around the web. That was enough to create products that can spit out clever essays and poems, but building AI systems that are smarter than a Nobel laureate — as some companies hope to do — may require data sources other than Wikipedia posts and YouTube captions…
The companies are facing several challenges. It’s become increasingly difficult to find new, untapped sources of high-quality, human-made training data that can be used to build more advanced AI systems.
As Dave and I noted, the books that have been used in AI training so far are scattershot or have come from suspect sources, and are likely counted in the tens of thousands rather than millions.
But this new project is about much more than the issues faced by these companies. Indeed, the project should give us a chance to redirect some attention away from ultra-large language models and generative AI to small language models and non-generative use cases for AI, to AI that isn’t about producing any kind of text on the fly but on the many other aspects of research and learning. To smaller-scale, community-based AI tool building.
Readers of this newsletter know that I’ve been skeptical of AI as a substitute for human creativity, especially writing. But I’m bullish on AI for other purposes that I see within libraries and the educational enterprise more generally. For instance, what can book-informed AI do for the creation of library metadata and comprehensive search? How can AI help locate works for careful human reading rather than summarization? Can AI help with the use of special collections that are not yet indexed?
More productive concepts will emerge as we step back from the overriding focus on the large AI models. Other important issues await. From the full grant announcement on the Authors Alliance website:
We seek to answer several key questions, such as:
- What are the right goals and mission for such an effort, taking into account both the long and short-term;
- What are the technical and logistical challenges that might differ from existing library-led efforts to provide access to collections as data;
- How to develop a sufficiently large and diverse corpus to offer a reasonable alternative to existing sources;
- What a public-interest governance structure should look like that takes into account the particular challenges of AI development;
- How do we, as a collective of stakeholders from authors and publishers to students, scholars, and libraries, sustainably fund such a commons, including a model for long-term sustainability for maintenance, transformation, and growth of the corpus over time;
- Which combination of legal pathways is acceptable to ensure books are lawfully acquired in a way that minimizes legal challenges;
- How to respect the interests of authors and rightsholders by accounting for concerns about consent, credit, and compensation; and
- How to distinguish between the different needs and responsibilities of nonprofit researchers, small market entrants, and large commercial actors.
Dave and I will be discussing the project this Monday, December 9, 2024, at the Coalition for Networked Information meeting in Washington, D.C., on a panel on library collections and AI that includes Mike Furlough, the Executive Director of HathiTrust; Claire Stewart, the Dean of Libraries at the University of Illinois at Urbana-Champaign; Günter Waibel, the Executive Director of the California Digital Library; and Suzanne Wones, the University Librarian of the University of California, Berkeley. Although the meeting is not open to the public, CNI will make a video recording available afterwards.
I hope that many readers of Humane Ingenuity will be interested in the progress of this AI + books project, and I’ll continue to write about it here throughout 2025.