by Dan Cohen
("The Library of the Distant Future," as envisioned by Midjourney, when I was let into the beta in March 2022.)
Before one can become a Cassandra or Pollyanna about the uses or abuses of impressive text-to-image AI tools like DALL•E and Midjourney, it is worth stepping back and reflecting about the fundamental nature of this new technology. What is it actually designed to do?
Just as text generators like GPT-3 are engineered to provide highly plausible sequential arrangements of words, these AI image generators are designed to meet our expectations, visually. This agreeableness is right there in the math, in the way these tools distill millions of images into a multidimensional array of the proximities of various styles and shapes. They angle to be familiar, and from what we have seen so far, they are succeeding.
Note that this familiarity and agreeableness doesn't mean they won't surprise us from time to time, or even delight us. Clever questioners have coaxed unusual outcomes out of these tools with creative incantations. But even those images in some way meet our expectations, as they must; they are internally structured, like a golden retriever, to be pleasing to the incantor. I will gladly admit that Midjourney's rendering of my request to conjure a "Library of the Distant Future" elicited an audible "wow" when it appeared. But then again, it also competently echoed the science fiction book covers of my childhood.
One can't be churlish about this. I applaud the creativity that has gone into the design of these new tools, and they can be great fun. But they also helpfully highlight, by contrast, the nature of truly creative art.
The best art isn’t about pleasing or meeting expectations. Instead, it often confronts us with nuance, contradictions, and complexity. It has layers that reveal themselves over time. True art is resistant to easy consumption, and rewards repeated encounters. Accomplished paintings challenge easy or unitary interpretations, like Mona Lisa's smile. The best books are worth reading multiple times, as we discover new elements and are affected differently each time we flip their pages.
As this new field of "AI art" develops, we should push for a higher-order Turing test: Are we inclined to view or read their outputs more than once, to ponder their deeper significance? Or, no matter how remarkable they may be, despite immediate, uncanny evocations of delight or humor or dread, do these images still exhaust their artistic reserves rapidly? If so, what does that tell us?
Complexity can be added to the machine; technologists are surely working on it. But the fundamental urge to meet expectations forms a major developmental barrier.
An AI text generator very well might spin a decent tale about a monomaniacal hunt for a white whale, perhaps even with copious Biblical references, given the right additional nudges, but would that work ever have the strange richness produced by a human writer familiar with the actual manual labor of whaling, and who is able to find layers of meaning in those seemingly mundane processes? An AI music generator very well might create the chord progression and melody of a decent song, with lyrics about envy and marriage, but can it record the heartbreaking plaintiveness of "Jolene" without the human experience of Dolly Parton?
The desire of AI tools to meet expectations, to align with genres and familiar usage as their machine-learning array informs pixels and characters, is in tension with the human ability to coax new perspectives and meaning from the unusual, unique lives we each live. Dolly Parton and Herman Melville worked within genres, and had their own arrays of common references, but they also exploded them in ways that could not be anticipated. That is a different sort of delight, and art.