A Framework for Books and AI in the Public Interest
How libraries could create a thoughtful and ethical interface between their shelves and artificial intelligence
by Dan Cohen

Two years ago, Dave Hansen, the Executive Director of Authors Alliance, and I wrote “Books Are Big AI's Achilles' Heel,” a piece on how the leading AI companies may have unimaginable sums of money and vast data centers, but are badly in need of what humble libraries have in abundance: books. Those companies, of course, understood this weakness and were trying to fill in the gap in any way they could. There are now dozens of lawsuits by authors and publishers against these tech firms for downloading and storing digitized books from the sketchier corners of the internet.
Dave and I proposed an alternative pathway, spearheaded by libraries and oriented not toward commercial uses but toward the public good:
A library-led training data set of books would diversify and strengthen the development of AI. Digitized research libraries are more than large enough, and of substantially higher quality, to offer a compelling alternative to existing scattershot data sets. These institutions and initiatives have already worked through many of the most challenging copyright issues, at least for how fair use applies to nonprofit research uses such as computational analysis. Whether fair use also applies to commercial AI, or models built from iffy sources like Books3, remains to be seen.
Library-held digital texts come from lawfully acquired books — an investment of billions of dollars, it should be noted, just like those big data centers — and libraries are innately respectful of the interests of authors and rightsholders by accounting for concerns about consent, credit, and compensation. Furthermore, they have a public-interest disposition that can take into account the particular social and ethical challenges of AI development.
Thanks to the Mellon Foundation, this planning project was funded, and we held workshops across the United States with librarians, scholars, technologists, authors, and publishers to imagine what such an initiative might look like, how it might function, and what it would take to bring it into existence. We’re delighted to release the final report from that yearlong study, The Public Interest Corpus: A Framework for Implementation, co-authored by Dave, Thomas Padilla, Giulia Taurino, and myself.
From the introduction:
The rapid advancement of artificial intelligence represents one of the most significant technological transformations of the twenty-first century, with profound implications for research, education, creativity, and civic life. Yet the development and deployment of AI systems is increasingly concentrated among a small number of well-resourced technology companies. This concentration stems not merely from access to capital and resulting computational infrastructure advantages, but also from asymmetric and unregulated access to training data.
While access to large-scale datasets is the main prerequisite of state-of-the-art language models, scholars and researchers have drawn attention to the importance of data quality in textual corpora used for AI training. Many have pointed to the need for curated, high-quality datasets, especially from library collections, which contain humanity’s most comprehensive and editorially refined record of knowledge, culture, and expression.
Currently, many academic researchers are denied access to this data for their own AI research due to a variety of legal, technical, and financial constraints. Our work on this project demonstrated a need for publicly accessible, research-oriented, computation-ready textual corpora to support academic work and non-profit AI development. The Public Interest Corpus initiative responds to this existing imbalance and pressing need by leveraging the unique position of research libraries to expand access to books data for academic and nonprofit AI training and computational research, thus ensuring that less-resourced institutions and individuals can gain equitable access to valuable data sources.
The report outlines our sense of how to move forward with books and AI, and seeks to address some hard issues that emerged from in-depth conversations we held, such as copyright questions and the needs of different users.
Given more recent technical developments, such as the Model Context Protocol, we also believe that the Public Interest Corpus will be able to serve not just noncommercial AI researchers, but also a broader audience among the public, students, and scholars. For instance, as I have noted in this space over the last year, AI shows great potential for creating a new digital entryway to the library, improving access and discovery by locating relevant books better than current library systems. For many, their interaction with AI will end after this phase of discovery and access; these library patrons will go on to read the books they have found rather than train new AI models with them. We should be enabling and encouraging these lighter uses of vectorized books as well as the heavier, more complex applications. Additional use cases emerged over the last year involving not one book or a million books, but collections at intermediate scales — what one might do with ten, a hundred, or a thousand books as part of a course, thesis, or research topic.
In the report, we also map out how the Public Interest Corpus should:
provide a secure technical environment for accessing data and provide the means to authenticate users and mitigate potentially infringing user behaviors
continually refine its data in order to increase the quality of the data we have about our books
encourage users to attribute books in their research through social and technical means
seek an environmentally sustainable infrastructure and mode of operations for its services
The team hopes that our report is a starting point rather than an endpoint, and we are currently working to make further progress toward implementation. My thanks again to the Mellon Foundation for generously supporting our work, and to Dave, Thomas, and Giulia, our helpful advisory board, and the many people we spoke to in 2025 for collaborating and advancing the Public Interest Corpus idea.