A recent article in Computer world argued that the output from generative AI systems such as GPT and Gemini is not as good as it used to be. It’s not the first time I’ve heard this complaint, although I don’t know how widespread this opinion is. But I ask: Is it right? And if so, why?
I think there are a few things going on in the AI world. First, developers of AI systems strive to improve the output of their systems. They’re more concerned (I’m guessing) with catering to enterprise customers who can handle large orders than catering to individuals paying $20 a month. If I did, I would tweak my model towards producing more formal business prose. (That’s not good prose, but it’s true.) We can say “don’t put AI output in your report” as often as we want, but that doesn’t mean people won’t do it—and that means AI developers will try to give it to them , what they want.
Learn faster. Dig deeper. See further.
AI developers are certainly trying to create models that are more accurate. The error rate has decreased noticeably, although it is far from zero. But tuning a model to have a low error rate probably means limiting its ability to come up with non-standard answers that we find cool, insightful, or surprising. This is useful. When you reduce the standard deviation, you cut off the tails. The price you pay for minimizing hallucinations and other errors is minimizing correct, “good” outliers. I’m not saying developers shouldn’t minimize hallucinations, but you have to pay a price.
“AI blues” have also been attributed to the collapse of the model. I think model collapse will be a real phenomenon – I even did my own very unscientific experiment – but it’s too early to see it in the large language models we use. They are not retrained often enough, and the amount of AI-generated content in their training data is still relatively very small, especially if their creators are massively infringing copyrights.
But there is another option that is very human and has nothing to do with language models themselves. ChatGPT has been around for almost two years. When it came out we were all amazed at how good it was. One or two people pointed to Samuel Johnson’s 18th-century prophetic statement: “Sir, the going out of ChatGPT is like a dog walking on its hind legs. It is not done well; but you’re surprised you even found out.”1 We were all amazed – bugs, hallucinations and all. We were amazed to find that the computer could actually engage in conversation – reasonably fluently – even for those of us who tried GPT-2.
But now it’s almost two years later. We got used to ChatGPT and its colleagues: Gemini, Claude, Llama, Mistral and other hordes. We’re starting to use GenAI for real work—and the awe has worn off. We are less tolerant of his obsessive verbiage (which may have increased); it doesn’t feel sacred and original to us (but we don’t really know if it ever was). While it is possible that the quality of the output of language models has deteriorated over the past two years, I believe the reality is that we have become less forgiving.
I’m sure there are many people who have tested this much more rigorously than I have, but I’ve run two tests on most language models since the early days:
- Writing the Petrarchan Sonnet. (A Petrarchan sonnet has a different rhyme scheme than a Shakespearean sonnet.)
- A proper Python implementation of a well-known but non-trivial algorithm. (I usually use the Miller-Rabin test for prime numbers.)
The results of both tests are surprisingly similar. Just a few months ago, the big LLMs couldn’t write a Petrarchan sonnet; they could get a Petrarchan sonnet right, but if you asked them to write one, they’d screw up the rhyme scheme and usually give you a Shakespearean sonnet instead. They failed even if you included the Petrarchan rhyme scheme in the challenge. They failed even if you tried it in Italian (an experiment done by a colleague of mine). Suddenly, around the time of Claude 3, models learned how to do Petrarch properly. It’s getting better: recently I thought I’d try two more difficult poetic forms: the sestina and the villanelle. (Villanelles involves repeating two lines in a clever way, in addition to following the rhyme scheme. A sestina requires the same rhyming words to be used again.) They did it! They are no match for the Provençal troubadour, but they did it!
I got the same results and asked the models to create a program that would implement the Miller-Rabin algorithm to test whether large numbers are prime. When GPT-3 first came out, it was a complete failure: it would generate code that ran without errors, but it would tell me that numbers like 21 were prime numbers. Gemini was the same way – although after several attempts they mercilessly blamed the problem on the Python library for computing large numbers. (I’ve learned that it doesn’t like users saying “Sorry, that’s wrong again. What are you doing wrong?”) Now they implement the algorithm correctly – at least the last time I tried. (Your mileage may vary.)
My success does not mean there is no room for frustration. I asked ChatGPT how to improve programs that worked correctly but had known problems. In some cases I knew the problem and the solution; in some cases i understood the problem but not how to fix it. The first time you try it, you’ll probably be impressed: while “wrap more program inside functions and use more descriptive variable names” might not be what you’re looking for, it’s never bad advice. However, the second or third time you realize that you are always getting similar advice, and although few would disagree, the advice is not really prescient. “Surprised it’s even done” quickly disintegrated into “it’s not done well”.
This experience probably reflects a fundamental limitation of language models. After all, they are not “intelligent” per se. Until we know otherwise, they are just predicting what should come next based on the analysis of the training data. What part of the code on GitHub or on Stack Overflow really demonstrates good coding practices? How much more pedestrian is it, like my own code? I would bet that the latter group dominates – and this is reflected in the LLM output. When I think of Johnson’s dog, I’m really surprised that it worked at all, though maybe not for the reason most people would expect. It’s clear that there are a lot of things on the Internet that aren’t wrong. But there’s a lot that isn’t as good as it could be, and that shouldn’t surprise anyone. Unfortunately, the volume of “pretty good, but not as good as it could be” content tends to dominate the output of the language model.
This is a big problem facing language model developers. How do we get answers that are insightful, engaging, and better than the average of what’s on the internet? The initial surprise is gone and the AI is judged on its own merits. Will AI continue to deliver on its promise, or will we just say, “That’s boring, boring AI,” even as its output creeps into every aspect of our lives? There may be some truth to the idea that we’re exchanging beautiful answers for reliable answers, and that’s not a bad thing. But we also need pleasure and understanding. How will AI ensure this?
Footnotes
From Boswell’s Life of Johnson (1791); possibly slightly modified.