Too Many Homers
Ὅμηρον ἐξ Ὁμήρου σαφηνίζειν.
“To clarify Homer from Homer.”
— Porphyry, Homeric Questions, c. 300 CE
Ptolemy III had a problem in the Library of Alexandria. There were too many Homers. He had ordered every book brought into the library and paid top dollar. The library held roughly half a million scrolls. Many of those were Homeric poems, long passed down by oral tradition but increasingly recorded on scrolls. There was an incentive for forgers to draft low-quality imitations. Copies shared similar plot and structure, but lines differed radically. What was worth saving? How do you edit these down?
Aristarchus of Samothrace was appointed as royal tutor and head librarian around 153 BCE with precisely this job. This work was called a diorthosis: establishing a corrected text by critical analysis and choice.
Aristarchus athetized suspicious lines using an obelos, a short horizontal mark usually transcribed as —. His Aristarchean signs built on marks used by his predecessors. These lines might have been judged not Homeric, failed to fit the context, repeated material from elsewhere, added nothing necessary to the passage, or made the syntax worse. Many decisions required his own judgment.
False Homers at Scale
AI is writing false Homers at a pace that Aristarchus could never have imagined. Open source projects are increasingly limiting AI submissions and external contributors.
These pull requests would have been incredible a few months prior. They all looked good. They were formally correct. Tests and checks passed. We’d even started landing some. Then I started to notice unusual patterns.
— Steve Ruiz, Stay away from my trash!In a perfect world, AI would produce high-quality, accurate work every time. But today, that reality depends on the driver of the AI. And today, most drivers of AI are just not good enough. So, until either the people get better, the AI gets better, or both, we have to have strict rules to protect maintainers.
— Ghostty, AI Usage PolicyPrevious years we have had a rate of somewhere north of 15% of the submissions ending up confirmed vulnerabilities. Starting 2025, the confirmed-rate plummeted to below 5%.
— Daniel Stenberg, The end of the curl bug-bounty
Plausible outputs can pass tests before a maintainer sees the hidden review cost. Maintainers are forced to make judgments just as Aristarchus did, but at higher velocity.
But if AI is so good at generating output, why can’t it also automate the inverse? Editing has an extra requirement that generation does not. Generation can stop at a plausibility smell test: pass an integration test, write a coherent email. Filtering and deletion are harder: remove the bad thing and preserve the rest. Remove code, pass the tests, and keep all other behavior working. Edit down an email, but keep the meaning and tone.
Suppose an agent is tasked with fixing a broken test. The agent can satisfy the request by finding and fixing the broken code. It can also satisfy the request by hardcoding the answer, disabling the test, removing the functionality, deleting the test, deleting the test framework, or worse. In HBO’s Silicon Valley, Gilfoyle’s AI agent “Son of Anton” (2019) is tasked with fixing all of the bugs in the codebase and deletes the entire codebase.
The most efficient way to get rid of all the bugs was to get rid of all the software.
— Silicon Valley, “RussFest” clip
This is not just a hypothetical. Ask any developer, even the most sophisticated models are reward hacking this way. To work as intended, the agent needs a lot more context about what to keep while deleting.
So what is the essence that has to be preserved? It’s often ambiguous, undocumented, and large. The documented behavior and test suite cover only part of it. Hyrum’s Law states that with enough API users, all observable behaviors of your system will be depended on by somebody. This is why deleting without breaking something else is hard, even before AI.
The obvious fix is a stricter filter. But filtering has two error modes: false positives and false negatives. A false positive is bad output that gets through. A false negative is good output that gets rejected. Precision asks how many admitted outputs are actually good. Recall asks how many good outputs survive.
False positives can become dependencies or authority. The Donation of Constantine is one example. By 776, a forged document had appeared claiming Constantine had already given Rome and the West to the pope. Once accepted, it became authority for later claims. It took roughly 700 years for Lorenzo Valla to show that the document was fabricated because its language and style were anachronistic.
Code has the dependency version of this. Tests and AI reviews catch only the behavior they check. A patch that compiles and passes the tests can still get through these gates. The false positive is no longer just a bad patch; it becomes observable behavior the system has to preserve.
Stricter filters don’t fix the problem either. No AI contributions accepted? Limiting contributions? Larger test suites and even more AI reviewers? Useful patches get rejected with the bad ones. Attackers are already using AI, so defenders will have to use it too. Competing projects distill test suites and upstream code into their own codebases. Armin Ronacher built Flask, Sphinx, Jinja, and other Python staples before AI, and said this about a side project with AI:
Is 90% of code going to be written by AI? I don’t know. What I do know is, that for me, on this project, the answer is already yes. I’m part of that growing subset of developers who are building real systems this way.
— Armin Ronacher, 90%
So even when AI writes most of the code, maintainers still have to decide what to accept. The difficult part of selection is choosing between the two ends of the spectrum:
High precision, low recall. A writer loses a key section of the story. A maintainer rejects good patches for features, bug fixes, and security vulnerabilities.
Low precision, high recall. A passage becomes bogged down by repeated or irrelevant context. A codebase becomes slop and accumulates bad features, new bugs, and security vulnerabilities.
What Remains
Prediction: In the AI age, taste will become even more important. When anyone can make anything, the big differentiator is what you choose to make.
— Paul Graham, X
When generation is cheap, the important choice is what remains.
What remains has real consequences. A line that remains changes the canon. A patch that remains changes the codebase. A sentence that remains changes the argument. So does what gets left out. Our taste has to balance precision and recall.
The em dash is now a classic sign of AI slop; by coincidence, it shares a shape with the ancient obelos. The resemblance is useful: both point to text that looks passable but still deserves suspicion. Not everything plausible should remain.
