Asking Good Questions Is Harder Than Giving Great Answers
The tests we are using to assess the intelligence of AI are missing an essential aspect of human inquiry — the query itself
by Dan Cohen

Recently, I sharpened a #2 pencil and took the history section of "Humanity's Last Exam.” Consisting of 3,000 extremely difficult questions, the test is intended for AI, not me. According to its creators and contributors, Humanity’s Last Exam will tell us when artificial general intelligence has arrived to supersede human beings, once a brilliant bot scores an A.
I got an F. Actually, worse than that: Only one of my answers was correct, and I must admit it helped that the question was multiple choice. This is fairly embarrassing for someone with a PhD in history.
What happened? Let me indulge in a standard academic humiliation-avoidance technique: examining the examiners. A much easier exercise. Of the thousands of questions on the test, a mere 16 are on history. By comparison, over 1,200 are on mathematics. This is a rather rude ratio for a purported Test of All Human Knowledge, and a major demerit in this human’s assessment of the exam.
The offense extends further to the historical topics covered. Of the 16 history questions, four of them — 25% of historical understanding! — are about naval battles. My knowledge of the displacement of various warships is admittedly weak. Other questions are byzantine, alas not literally, but figuratively, long narrative journeys with twists and turns that are clearly trying to confuse any AI by flooding its memory with countless opaque terms. Those questions certainly succeeded in confusing me.
I will not be reproducing the history questions here since the creators of Humanity’s Last Exam don’t want AI to have a sneak peek at the questions ahead of taking the test. Of course, this raises another question: Would a true superintelligence cheat? I feel like it would? If you, presumably a human reader, want to take the test yourself, you can find a database of the questions on Hugging Face and GitHub. I should also note that I did not take the “classics” section of the exam, as I am a historian of the modern era and do not know Latin, Greek, etc., but much of that section is history too, perhaps because there were also naval battles in the ancient world.
* * *
Although I failed Humanity’s Last Exam, I did learn something about the current state of our assessment of AI, and what we expect from it. HLE’s implicit definition of “intelligence” is the ability to provide correct answers to complicated questions, and it is just one of many similar exams. Another, less naval-gazing test of historical knowledge is based on a comprehensive global history database, but still relies on question-answer pairs so it can provide numerical scores for each LLM’s ability. Upon the release of their latest models, AI companies tout improvements on these assessment tools, which allows them to proclaim definitive AI progress: “This LLM got a 92% on a PhD-level history exam, up from 56% last year!”
And the companies are not wrong about genuinely impressive improvements. Six years ago in this newsletter, I wrote about some initial testing I had been doing with computer vision APIs from Google and Microsoft, a first attempt to analyze the photo morgue my library had recently acquired from the Boston Globe. There were glimmers of hope that these pre-GPT tools could help us identify topics in millions of photographs that lacked rigorous metadata, and I found even 80% accuracy to be promising. Now our library’s digital team, much more capable than I am, has created an abstracted interface to all of the main multimodal AI services and is testing the ability of these services to provide subject headings and descriptions, with much better results (although all of the services are still imperfect).
Fellow historian Benjamin Breen has documented similar advances in his testing of AI. The latest models are scarily on par with a first-year doctoral student in history in some areas, able to provide solid context and advanced interpretations of documents and images, even complex ones that require substantial background in a field. The frontier models are much better than most doctoral students in other tasks, such as translation and transcription. Handwriting recognition for historical documents, in particular, has been among the hardest problems for computer scientists to solve, and cracking it will have a significant impact on historical research. Historian Cameron Blevins has shown that custom GPTs are now on a path to a solution that could make archives and special collections much more searchable and readable in ways that might transform our ability to do history. What these other tests of artificial intelligence show is that significant AI progress may lie not in some kind of examination endgame, of perfect answers to tough questions, but in the important, but often hidden, middle stages of a research project, when evidence is being assembled and interpreted.
* * *
Even more obscured right now in the conversation about AI and intelligence is that PhD-level work is not just about correct answers. It is more about asking distinctive, uncommon questions. Ultimately, we may want answers, but we must begin with new queries, new areas of interest. Along the way to a better understanding of the past and present, good questions in history may eventually require accurate translations of inscriptions or the location of sea skirmishes. But first, we must imagine why someone, today, should care about such documents and events in the first place, envision how they may have shaped our world. This is a much bigger challenge.
The most vibrant historical studies begin with questions that are unexpected and which therefore have revelatory power. Recently in this newsletter, for instance, I covered a book that originated with the seemingly simple query, “Why did audiences at orchestral performances become silent when previously they were rowdy?” Before I read Listening in Paris, I assumed naively that the eternally proper behavior at a concert has been respectful quiet. By asking this curious question, James Johnson was able to unveil a major change in the nature and relationship of music, composers, and audiences that still resonates today, even if our musical tastes have largely changed.
Other books that have influenced me originated with equally novel questions. Why, over a relatively short period of time, did the British radically change their view of some animals, like dogs, from unkempt wild beasts to delightful members of the household, proudly coiffed and paraded at dog shows? Why did Isaac Newton, the paragon of modern science, write more on alchemy than he did on physics or math? How does the experience of war — not the abstract tactics of naval battles but the actual first-person experience — profoundly change individual soldiers and then, in aggregate, an entire culture?
Can AI ever produce good questions in history rather than great answers? I’ll tackle that important question in another newsletter.