AI Is Coming for Scholarship Next
AI models are now ingesting scholarly content in an attempt to dispel their hallucinations. But another possibility looms: that AI will instead drag down scholarship into its muddy realm.
by Dan Cohen
[A still from the video artwork "Signs Facing the Sky" by Allora & Calzadilla, from the DVD-R held in UC San Diego's Special Collections & Archives.]
In his most recent and perhaps strangest change to Twitter, Elon Musk curtailed access to the service itself, reducing the number of tweets users can view in a day, and requiring everyone to log in to see tweets that were once publicly available. He justified these new limits, which seem at odds with advertising-based media that seeks the largest possible audience, by pointing to the tech obsession of our time, artificial intelligence — specifically, its insatiable need for large amounts of texts and images on which to train its powerful automated systems. Bots are out there copying collections of writing, art, and photographs, Musk fumed, and Twitter — like Reddit before it, with its forums providing millions of posts for these new machines to chew on — shouldn’t be a free meal for the hungry maw of AI.
There likely were more mundane reasons for Musk’s sudden restrictions, including a degradation of Twitter’s infrastructure after relentless layoffs and a failure to pay its bills, but Musk is not wrong about the growing need for human-created outputs to input into AI models. The more data these models acquire, the stronger they become.
However, in the quest for these training sources, many AI purveyors now want to look in a very different direction than low-quality tweets: the high-quality, peer-reviewed, deeply researched works of academia. Forget the dross of social media and web forums; AI is coming for the gold standard of scholarship next — the articles of academic journals, the books on the shelves of research libraries, the data recorded in university labs.
At this spring’s meeting of the Coalition of Networked Information, which brings together librarians, IT professionals, researchers, and publishers to discuss cutting-edge technology and its impact on academia, a plenary session focused on the proliferation of AI tools now operating on scholarly works. Apps that have ingested millions of academic articles are starting to provide much more advanced and tailored capabilities than older search engines. Soon these tools will diminish the need to search the vast sea of publications at all: they will simply summarize the research on a topic for you in an instant. Why trudge through hundreds of papers, PDF by PDF, when AI can pre-digest it for you?
AI gadgets may replace other key elements of academic research. Systematic review, in which a scholar aggregates and blends the results from multiple publications, such as every study that examines air quality and the prevalence of asthma, can be a very time-consuming process. In the near future, these potent syntheses might be possible at the click of a mouse. Using just 10-20 articles from your targeted subject matter, Iris.ai will offer you an entire suite of AI tools that use natural language processing to automate all forms of research drudgery, from extracting relevant data to alerting you to new work in your field.
These emerging AI tools are tightly focused on academia, but the specialist apps will be closely followed by the generalists. Tireless responders to any request, such as OpenAI’s ChatGPT and Google’s Bard, are coming for scholarship too. Indeed, by their own admission, they have already consumed a range of academic articles.
But that is undoubtedly just the start. Big AI will be getting additional, and perhaps unintentional, help from a decades-long trend in academia toward “open access,” which has pushed out articles formerly held behind pricey paywalls. This movement, cutting in the opposite direction from Musk and Twitter, is coming to fruition right as Silicon Valley is finding great value in gigantic collections of language, data, and images. Governments in many countries increasingly mandate that publicly funded research be made open to the public. A year ago, the White House Office of Science and Technology Policy, led by Alondra Nelson, issued a memorandum formalizing this transition from closed to open in the U.S.: “Ensuring Free, Immediate, and Equitable Access to Federally Funded Research.” The same ethic that is opening up scientific articles is also pressing scientists to expose their underlying data sets as well.
This is an important and virtuous moral stance. Expanding availability, globally, to academic research means that colleges with few resources, as well as independent scholars and the merely curious, have the same access to the latest ideas and studies as the largest, richest universities. But a now clear side effect of this push toward openness is that whatever the public has access to can be devoured by big tech companies too, for free.
Although a smaller portion of academic books are openly available compared to articles, that number is also growing. AI vendors may not be waiting for a White House memo to include these volumes in their models. OpenAI admits to training its GPT models on books, although they are elusive about whether this collection includes in-copyright books as well as those in the public domain. (Last week they were sued by two authors who believe their books have been illegally ingested.) Google’s herculean project to digitize entire libraries, Google Books, once looked like an expensive passion project of Google co-founder Larry Page; in the age of AI, it now looks like a relatively cheap source of rich, accurate text. (One assumes Google will also be sued for this use, although a failed lawsuit against Google Books may work against the plaintiffs.)
AI enterprises will surely have their vacuums at the ready as more of academia is exposed to the open internet. What could be a better way of improving the results achievable with current AI models than to attach a large language model to the massive output of research universities? Last year, people complained that ChatGPT hallucinated references to entirely imaginary academic articles. Maybe next year it will be able to point directly to real scientific papers to back up its conclusions, as the new AI tool Scite.ai already does. By using the research products of academia as inputs, AI can swiftly raise its standards.
But a very different and unsettling possibility looms on the horizon: that AI will instead drag down the quality of scholarship, pulling it into the muddy realm of tweets and Reddit posts.
The AI toolmakers that are operating on large masses of articles, books, and data to generate better and faster academic research have noticed that, like ChatGPT, they can create scholarly writing in an automated or semi-automated way. Writefull, which has a motto that makes professors cringe, “Academic writing is hard,” will make it rather easy for you to crank those papers out, extruding erudite-sounding prose on the subject of your choosing. Trinka will seamlessly correct grammatical errors in the complex technical writing prevalent in academia. Both are trained on millions of scholarly papers.
In hallway conversations at the CNI meeting, attendees had no trouble connecting the AI dots: point these tools at your digital lab notebook or other research sources, have AI summarize the existing literature in the field, and then auto-generate drafts of the standard sections of an academic paper. Maybe a little light editing and you’re done! Could the traditional academic struggle of “publish or perish” become a painless series of clicks?
Such chatter went from possibility to probability, and then to profound concern about the future of academia, a month later at the annual meeting of the Society of Scholarly Publishing. The dark title of SSP’s plenary session: ““Resolved: Artificial Intelligence Will Fatally Undermine the Integrity of Scholarly Publishing.”
Tim Vines, the founder of DataSeer and a former editor of the journal Molecular Ecology, envisioned a doomsday scenario. “It will soon become almost impossible to distinguish the products of Artificial Intelligence from products made by humans. Unscrupulous researchers will be able to conjure up convincing research articles without the trouble of picking up a pipette,” he lamented. Worse, because of the move toward open access and the need to replace the revenue from paywalls, many scholarly journals are now charging researchers to publish their articles, which strongly inclines them to accept papers rather than reject them. “Scholarly publishing has a dirty secret,” Vines continued, “A substantial fraction of the industry prefers not to ask awkward questions of their authors. These publishers are instead happy to receive the author fee in return for publishing the article and have no incentive to weed out plausible fake articles. Why would they?”
Maybe AI tools can help to combat their unethical counterparts? SciScore seeks to improve the reliability of scientific papers by analyzing their methods and sources, producing a set of reports for editors, peer reviewers, and other scientists who want to reproduce an experiment. Ripeta uses AI trained on over 30 million articles to identify “trust markers” within a paper’s dense text. Using new AI computer vision tools, Proofig takes aim at falsified images within academic work.
But fighting AI with AI assumes a level of care and attention that are increasingly scarce resources in academia. As scholarly publishers will admit, peer reviewers are harder and harder to come by, as journals proliferate and there are greater pressures on the time of every professor. It’s more productive to crank out your own work than to correct the work of others. Professors who are concerned about their students using ChatGPT to create plausible-sounding essays might not look over their shoulders at their own colleagues using more sophisticated tools to do the same thing.
If they — and we — fail to stem the tide of AI-generated academic work, that very work will come into question, and one of the last wells of careful writing, of deep thought, of debate supported by evidence, might be fatally poisoned.