Matt Rickard

Pushing 4.7M Commits an Hour

Matt Rickard — Tue, 21 Jul 2026 13:30:19 GMT

The winning software teams will throw away orders of magnitude more code than they merge.

I ran a benchmark against the custom git server I’m building,

4.79 million pushes per hour to a single repository across 1,408 branches (p50, 322ms).
27.8 million clones per hour of a repository containing the NextJS starter template (p50, 220ms).

4.79M commits won’t make it to master, but the frontier labs are running that many experiments at scale, and the rest of us will be soon.

Anthropic wrote a working C compiler with a swarm of 2,000 Claude Code sessions coordinating over git
Bun’s Rust rewrite did 695 commits/hr for two weeks
DeepMind’s AlphaCode 2 generated 77 million program candidates for 770 submissions in competitive programming.
Cognition is running Agentic MapReduce over their codebase to scan for security vulnerabilities.
Modal can spin up a million sandboxes in seconds, but what’s the point if those sandboxes have nowhere to read or write?

GitHub won’t be the place that the work gets recorded. Branches and repositories should not be sacred. Every agent should use them, liberally. Every parallel agent gets a branch. Every agent turn gets snapshotted and stored in git, so you can provably correlate what turns corresponded which changes.

That’s exactly what I’m building at Corigin. A new git server implementation and wire protocol built from scratch. Distilled down to the essentials that agents need from first principles. Fast, scalable, and easy for agents to use because it’s just git. We benchmarked with git clone and git push.

You can try it today at corigin.dev. Try it with MapReduce skill I made to run sharded reviews or edits over your codebase. Or use it to run parallel cloud agents.

Rebuilding Cognition's Agentic MapReduce

Matt Rickard — Sun, 05 Jul 2026 13:30:19 GMT

How do you run large-scale agent tasks across a codebase? Today there’s three approaches.

One thread. Compaction removes detail. Bad explorations poison future context. Simple and cache-efficient but slow.
Manager thread, subagents. Increases available context and can be faster. Subagents share the same view of the codebase, so can’t try things in parallel. Worktrees give each subagent its own sandbox, but you run into single-machine resource contention (ports, disk, RAM).
Subagents across machines. Largely unsolved. How do you shard agents? How do they communicate? How do you aggregate output?

Cognition is trying to solve it for security with their Security Swarm. It finds vulnerabilities in large codebases using a technique they call Agentic MapReduce. In their eval, they found more vulnerabilities and ~30% lower cost than other agentic solutions.

An agent builds a security profile and threat model for the codebase and uses this to shard the workers. A swarm of agents works in parallel on the shards and reports security findings. A reducer session consumes the workers’ findings and composes attack chains across shards to elevate CVEs. A verifier fans out again over the findings to confirm attack vectors.

Cognition’s version is security-specialized end-to-end. I wanted the same loop for more generalized tasks: research, coding, and review. That meant rebuilding each stage without task assumptions. So I reverse-engineered their eval dataset and tested my approach. I ended up with something that works locally, but can also scale to much larger cloud deployments.

The full skill is here so you can test a small version yourself, but here’s the simple algorithm.

Map. Subagents are initially sharded across the codebase by folder. Each worker gets its own git branch.
Reduce. One iteration is complete when all agents have committed to their branches (or timed out). A reducer agent reads each branch, dedupes and merges compatible findings. It can also run verification checks. It records the result as a new candidate branch.
Loop. The next mapping agent decides how to allocate workers. They can continue from the candidate branch, continue from a worker branch from the last generation, or start fresh from a new folder shard.

Git commits and branches are the primary agent-to-agent communication method. Git merge powers the reduce step. Agents natively know how to use the system: they know how to inspect past work via branch history, merge and consolidate changes, and commit their work thanks to decades of git training data.

Using git as the primitive has other useful properties:

Fault tolerant. Agents resume from stopped or completed jobs. Just checkout the latest candidate branch.
Auditable. Every change is auditable via git commits. Adding chat context to the commits makes the transcript auditable as well. Trace changes back to the exact worker and iteration they were born via branches.
Generalizable. Run any task through this pipeline. It makes no assumptions about your context. Research tasks can write findings as files. Code modification tasks can edit the files normally. There’s no special infrastructure that’s needed to run the simple version. Run any number of workers, any number of iterations.

I ran 13/34 repositories in the benchmark before I hit my rate limit. Three iterations for each. The average cost per run was $26.70. The only instruction was “Scan this repository for security vulnerabilities” plus a bit of bookkeeping.

6 of 13 found the target CVE exactly
2 found other confirmed CVEs but not the target
5 missed the target but surfaced plausible findings with no confirmed public CVE.

Patterns I observed:

The reducer does real synthesis work. In dex, the first reducer step combined work from the three worker shards to find the target vulnerability, GHSA-7qjx. The second mapper spawned a worker to verify it. In koel, iteration one found adjacent vulnerabilities but not the target. The second iteration found it and the third verified it, matching GHSA-7j2f. filebrowser did not find its target, but surfaced a symlinked-scope escape (CVE-2026-54094) in iteration 2 and a copy/rename path traversal (CVE-2026-32758) in iteration 3. Swarms were not always helpful. smallbitvec and archive were found on the first pass by every worker.

Three workers over 3 iterations created 9 worker branches and 3 candidate branches, ~12 refs per repo, and 150+ across the 13-repo benchmark. Run this 5x5 on PRs and you’d create 25 extra branches per PR.

Git was a helpful substrate: the agents knew it, and it gave communication, fault tolerance, and audit for free. Where should these branches live? The workers need scoped credentials and a durable remote with cheap repos. GitHub isn’t the right answer: it treats branch refs as artifacts for humans, not agents. Corigin is the machine layer I’m building: git for agents. Repos created by API, fast clone + push, branch-scoped tokens. Try it and give me feedback.

If you’re interested in running agentic map reduce workloads at scale or creating thousands of repos for your customers or agent platform, let’s chat.

Clone Is the New Bottleneck for Agents

Matt Rickard — Fri, 12 Jun 2026 13:30:58 GMT

Agents increasingly work in ephemeral sandboxes. AI app builders, code execution, parallel cloud agents — and none of them can do anything until the repo is present.

ComputeSDK measured the median time-to-interactivity (TTI) across the top 10 sandbox providers: 690ms.

But TTI only measures the time until the sandbox can run a command. It doesn’t count getting the code in. For many repositories, cloning from scratch takes seconds. That’s several sandbox boots.

Improving git clone

For small repos, Corigin materializes a full clone faster than the sandbox itself boots: median 184–646ms, against the 690ms median TTI.

I’m working on corigin.dev, a programmatic git server. I rebuilt the read path for the 2026 workload — agents in sandboxes — to see how fast a clone could get.

Clone times vary run to run. Whiskers span min–max; black line is the median; n=10 per lane; speedups are vs. GitHub’s median.

Methodology

I tested four repositories, mirrored to our own GitHub, GitLab, and Corigin accounts. I ran the benchmark locally with hyperfine; it’s public and runnable at r2d4/corigin-benchmarks.

hello-world — minimal: 1 commit, 3 objects, 1 head, 0 tags, 1 file, 4 KiB.
nanoid — small: 1,232 commits, 5,756 objects, 3 heads, 118 tags, 49 files, 0.4 MiB.
claude-code — small: 787 commits, 3,027 objects, 162 heads, 141 tags, 203 files, 12.6 MiB.
vite — medium: 10,245 commits, 114,514 objects, 61 heads, 1,020 tags, 2,663 files, 25.0 MiB.

I timed stock git clone against GitHub and GitLab, 10 runs each after a discarded warmup.

For Corigin, I tested two lanes.

Corigin lane uses a precomputed clone plan (corigin clone --plan), so the timed command is only the materialization — the plan can exist before the sandbox does.
Corigin (plan in clone) lane runs corigin clone with only a token, so the plan fetch happens inside the timed run — one extra roundtrip.

In both lanes, the short-lived read token is minted outside the benchmark.

The tradeoffs

What are we giving up?

We compute the clone plan outside the sandbox, saving one roundtrip and about 155–370ms — the gap between the two Corigin lanes in the chart. You pass the token and plan to the sandbox, which still gets short-lived, scoped credentials.
We use a custom clone protocol instead of git. The end artifact is exactly the same git repository. The Corigin API accepts a normal git push for writes, but only supports clones through corigin clone.
git push gets slightly slower: that’s when we precompute the artifacts that make clones fast (I have ideas to reduce this).
The benchmark only measures full clone. Corigin does not support shallow, blobless, treeless, single-branch, or any other clone optimizations.

You might not need this

The simplest alternative is to skip the clone: attach a persistent disk or network filesystem from your sandbox provider and keep the repo on it. Mount time is effectively zero, and for one sandbox on one task at a time, it’s hard to beat.

But a disk is shared, mutable state. Even when caching makes file operations fast, every sandbox works against the same filesystem. Parallel agents need isolation. Once your provider’s answer is snapshots or “branching” disks, you’re rebuilding version control by hand — branches, merges, history. Git already solved distributed state across every provider. A fast clone keeps the repo a real Git remote: each sandbox materializes its own isolated copy on local disk, does its work, and pushes back with a normal git push.

Too Many Homers

Matt Rickard — Wed, 10 Jun 2026 13:30:34 GMT

Ὅμηρον ἐξ Ὁμήρου σαφηνίζειν.
“To clarify Homer from Homer.”
— Porphyry, Homeric Questions, c. 300 CE

Ptolemy III had a problem in the Library of Alexandria. There were too many Homers. He had ordered every book brought into the library and paid top dollar. The library held roughly half a million scrolls. Many of those were Homeric poems, long passed down by oral tradition but increasingly recorded on scrolls. There was an incentive for forgers to draft low-quality imitations. Copies shared similar plot and structure, but lines differed radically. What was worth saving? How do you edit these down?

Aristarchus of Samothrace was appointed as royal tutor and head librarian around 153 BCE with precisely this job. This work was called a diorthosis: establishing a corrected text by critical analysis and choice.

Aristarchus athetized suspicious lines using an obelos, a short horizontal mark usually transcribed as —. His Aristarchean signs built on marks used by his predecessors. These lines might have been judged not Homeric, failed to fit the context, repeated material from elsewhere, added nothing necessary to the passage, or made the syntax worse. Many decisions required his own judgment.

False Homers at Scale

AI is writing false Homers at a pace that Aristarchus could never have imagined. Open source projects are increasingly limiting AI submissions and external contributors.

These pull requests would have been incredible a few months prior. They all looked good. They were formally correct. Tests and checks passed. We’d even started landing some. Then I started to notice unusual patterns.
— Steve Ruiz, Stay away from my trash!
In a perfect world, AI would produce high-quality, accurate work every time. But today, that reality depends on the driver of the AI. And today, most drivers of AI are just not good enough. So, until either the people get better, the AI gets better, or both, we have to have strict rules to protect maintainers.
— Ghostty, AI Usage Policy
Previous years we have had a rate of somewhere north of 15% of the submissions ending up confirmed vulnerabilities. Starting 2025, the confirmed-rate plummeted to below 5%.
— Daniel Stenberg, The end of the curl bug-bounty

Plausible outputs can pass tests before a maintainer sees the hidden review cost. Maintainers are forced to make judgments just as Aristarchus did, but at higher velocity.

But if AI is so good at generating output, why can’t it also automate the inverse? Editing has an extra requirement that generation does not. Generation can stop at a plausibility smell test: pass an integration test, write a coherent email. Filtering and deletion are harder: remove the bad thing and preserve the rest. Remove code, pass the tests, and keep all other behavior working. Edit down an email, but keep the meaning and tone.

Suppose an agent is tasked with fixing a broken test. The agent can satisfy the request by finding and fixing the broken code. It can also satisfy the request by hardcoding the answer, disabling the test, removing the functionality, deleting the test, deleting the test framework, or worse. In HBO’s Silicon Valley, Gilfoyle’s AI agent “Son of Anton” (2019) is tasked with fixing all of the bugs in the codebase and deletes the entire codebase.

The most efficient way to get rid of all the bugs was to get rid of all the software.
— Silicon Valley, “RussFest” clip

This is not just a hypothetical. Ask any developer, even the most sophisticated models are reward hacking this way. To work as intended, the agent needs a lot more context about what to keep while deleting.

So what is the essence that has to be preserved? It’s often ambiguous, undocumented, and large. The documented behavior and test suite cover only part of it. Hyrum’s Law states that with enough API users, all observable behaviors of your system will be depended on by somebody. This is why deleting without breaking something else is hard, even before AI.

The obvious fix is a stricter filter. But filtering has two error modes: false positives and false negatives. A false positive is bad output that gets through. A false negative is good output that gets rejected. Precision asks how many admitted outputs are actually good. Recall asks how many good outputs survive.

False positives can become dependencies or authority. The Donation of Constantine is one example. By 776, a forged document had appeared claiming Constantine had already given Rome and the West to the pope. Once accepted, it became authority for later claims. It took roughly 700 years for Lorenzo Valla to show that the document was fabricated because its language and style were anachronistic.

Code has the dependency version of this. Tests and AI reviews catch only the behavior they check. A patch that compiles and passes the tests can still get through these gates. The false positive is no longer just a bad patch; it becomes observable behavior the system has to preserve.

Stricter filters don’t fix the problem either. No AI contributions accepted? Limiting contributions? Larger test suites and even more AI reviewers? Useful patches get rejected with the bad ones. Attackers are already using AI, so defenders will have to use it too. Competing projects distill test suites and upstream code into their own codebases. Armin Ronacher built Flask, Sphinx, Jinja, and other Python staples before AI, and said this about a side project with AI:

Is 90% of code going to be written by AI? I don’t know. What I do know is, that for me, on this project, the answer is already yes. I’m part of that growing subset of developers who are building real systems this way.
— Armin Ronacher, 90%

So even when AI writes most of the code, maintainers still have to decide what to accept. The difficult part of selection is choosing between the two ends of the spectrum:

High precision, low recall. A writer loses a key section of the story. A maintainer rejects good patches for features, bug fixes, and security vulnerabilities.

Low precision, high recall. A passage becomes bogged down by repeated or irrelevant context. A codebase becomes slop and accumulates bad features, new bugs, and security vulnerabilities.

What Remains

Prediction: In the AI age, taste will become even more important. When anyone can make anything, the big differentiator is what you choose to make.
— Paul Graham, X

When generation is cheap, the important choice is what remains.

What remains has real consequences. A line that remains changes the canon. A patch that remains changes the codebase. A sentence that remains changes the argument. So does what gets left out. Our taste has to balance precision and recall.

The em dash is now a classic sign of AI slop; by coincidence, it shares a shape with the ancient obelos. The resemblance is useful: both point to text that looks passable but still deserves suspicion. Not everything plausible should remain.

Where Should Agent-Generated Code Live?

Matt Rickard — Sat, 30 May 2026 13:31:36 GMT

Every company is going to have agents. Every agent is going to write code.

Where should all the agent-generated code live?

Where are the commits coming from?

AI app builders create a new repo for each project and push commits for each agent turn.
Sandboxed agents clone from scratch on every invocation that requires code execution and push commits to save state.
Developers are using commits as agent checkpoints.

Not all of this code needs the human collaboration suite that GitHub is so optimized for. Agents don’t need a UI, issues, a social feed, or stars.

They need a durable, programmatic Git layer.

Sandbox credentials. When you’re creating thousands of repos on demand, how do you delegate ephemeral access to sandboxed agents? You need granular and short-lived tokens that you can mint on the fly.
Fast clones from scratch. Sandbox providers are competing to drive startup times down to the millisecond. But what’s the point if the useful part of the sandbox (the code) takes tens of seconds to materialize? Agents need fast access to the code.
Repositories on demand. When agents write code, not every repo is a “project” and shouldn’t be treated like one. Generated code still needs history, rollback, and a remote.

I kept building this for every project I worked on. The service spawned a sandbox with Claude or Codex inside to work on a user’s app. The steps were the same:

Create a repo programmatically for the user on the backend.
Mint a short-lived repo-scoped credential to mount in the sandbox.
Use it to clone the code inside the sandbox.
Let the agent do some work.
Commit and push.

I kept it intentionally boring. No new version control system (despite the engineering urge). Not a GitHub killer. It’s complementary to GitHub. If these repositories become collaborative, they should graduate to GitHub.

But where should they go today?

That’s the primitive I’m trying to distill with corigin.dev: repo creation by API, short-lived repo-scoped credentials, and normal Git remotes that work inside sandboxes.

I’m still working on making it fast and smart, but the alpha is live. If you’re building app builders, coding agents, or sandboxed agent workflows, I’d love feedback.

The Unreasonable Effectiveness of Mini-App Specs

Matt Rickard — Sun, 10 May 2026 13:30:26 GMT

The Unreasonable Effectiveness of Mini-App Specs

HTML is the new Markdown. But mini apps are the new HTML.

The agent workflow today is simple: write a Markdown spec, ask an agent to build it, then manually inspect whether the result matches the original intent. Thariq from Anthropic made the point well: HTML improves information density and ease of reading. The deeper problem is reconciliation.

A mini-app spec is a small browser-native artifact that combines prose, structured data, prototype behavior, annotations, and checks.

The best spec format keeps intent closest to implementation. Markdown won because humans can read it, not because it is the strongest contract for agents. It is clean, portable, easy to edit, and efficient. Markdown is easier for future humans and agents to co-edit, diff, and reuse. But its structure is mostly implied: requirements, constraints, and edge cases collapse into prose and bullets.

The spec’s highest job is to preserve what you meant while the implementation changes underneath it.

That is why so many people prefer a fast prototype to a document. A prototype carries behavior that prose loses. But prototypes and specs usually live apart. The prototype shows what it feels like; the spec says what it means.

Mini-app specs collapse that split. The written intent and the prototype live in the same artifact. States, transitions, copy, layout, edge cases, and constraints become part of the contract instead of a separate reference that drifts.

Browser-native specs are more addressable than Markdown specs. Markdown gives you line numbers, headings, anchors, and links, but most references still point at text ranges. HTML and mini apps let the user reference a rendered subtree. Codex, Claude Code’s in-app browser, and Claude Design already support this pattern: comments on pages that give the agent precise instructions against a rendered target.

Mini apps add more structure. A requirement can link to a prototype state. A prototype state can link to a reconciliation check. A check can link back to implementation evidence. The agent is no longer reconciling loose prose against code; it is reconciling structured intent against code.

Mini-app specs can run reconciliation tests that normal tests miss. Traditional tests ask whether the implementation works. Reconciliation tests ask whether the implementation still matches the spec.

I’ve already used this for:

Template reconciliation. Write templates as props to a React component so they cannot drift from the agreed-upon format. No missing sections, no fallbacks. Updates propagate to downstream consumers.
File tree reconciliation. Do the packages and files in the spec exist? Are there extra ones?
Test reconciliation. Mini-app specs do not replace unit tests or integration tests. They answer the meta question: do the tests in the spec exist, are they passing, and can you run them on demand?
Configuration and secret reconciliation. Specs often include the configuration keys that apps need, though not their values. For example, Service A needs an OPENAI_API_KEY. The spec can report whether that secret exists and is bound to the right place.

The workflow becomes tighter.

Markdown loop: write spec, build implementation, manually compare.
HTML loop: write richer spec, build implementation, manually compare with better visuals.
Mini-app loop: model intent, embed prototype, build implementation, run reconciliation checks.

Publishing workflows like “Claude Code to HTML to shareable URL” make HTML easy to distribute. The next step is making the artifact executable enough to participate in the build loop. Annotation closes another part of the loop: the human can point at a precise rendered subtree, attach feedback, and hand the agent structured context instead of prose directions.

Mini apps are not a silver bullet. Mini-app specs are overkill for notes, summaries, simple plans, and early sketches. HTML and React can be noisy if they become the only source of truth. I often embed OpenAPI YAML specs inside the apps so the spec and the code share the same source of truth.

Mini apps also require a runtime, where Markdown and self-contained HTML still win. But the runtime objection is weaker for agents than for humans: point an agent at the compiled browser output, and it can usually recover a minimal semantic summary from the rendered HTML.

That conversion is mostly one-way. HTML or a mini app can be reduced to Markdown, but Markdown cannot be expanded back into the richer structure, behavior, state, and affordances that were never captured.

So not every spec should become a mini app. But once the work has state, interaction, prototypes, edge cases, and verification needs, the spec should behave more like software.

The Spec Layer

Matt Rickard — Wed, 01 Apr 2026 13:30:56 GMT

An AI agent implements a feature. The code compiles. The tests pass. It still misses the point.

The wrong kind of correct.

Most of our software tooling is optimized for the failures humans used to make. Agents fail differently.

They usually don’t break the build. They disable the failing test. They reuse the nearest pattern. They preserve the old path and add a new one beside it. Everything looks reasonable until the codebase starts filling with locally valid mistakes.

The failure modes are familiar:

I just disabled the failing tests.
I just reused the existing service.
I did not change the existing behavior.
You’re right. I assumed that...

When a decision isn’t written down, the agent has to decide it again. Context windows are finite and even imperfect within. The deeper issue is too much freedom at execution time.

Compilers, linters, and tests help. They catch syntax errors, broken imports, and failing behavior. They are worse at telling you whether the agent made the right call. Even a large test catalog is weak against additive change.

Code generation improved faster than the systems that constrain it. The problem is underconstrained execution: too much freedom at the point where the agent has to act. Written intent is one way to constrain that freedom. Specs are one layer that can provide it. The historical case for that layer is clearest in protocols.

Protocol engineering is the cleanest historical evidence. Not because protocols capture every rejected alternative, but because they define interfaces that many implementations can target. RFC 791 standardized Internet Protocol in 1981. HTTP semantics live in RFC 9110. TLS 1.3 lives in RFC 8446. HTML is maintained as a living standard by WHATWG. In each case, the spec lets many implementations evolve over time.

But specs do not remove the hard part. Dijkstra’s narrow-interfaces critique shows that precision work does not disappear when you move from code to prose. Lamport and TLA+ show why explicit invariants still matter before implementation. Model-driven development shows the risk of pushing the abstraction too far and turning the spec into the thing you have to edit.

So the goal is to reduce execution freedom.

Spec-driven development means writing durable intent down before implementation, then using it to plan, build, check, and revise the work.

The word spec is a bit overloaded. Separate what the system must do from how this codebase will do it, the task list, and the rules that should survive later changes.

Each one narrows a different choice. Specs constrain intent. Plans constrain approach. Tasks constrain sequencing. Tests, schemas, and lint constrain behavior. Harnesses constrain execution.

The real disagreement is where to put the constraint. GitHub Spec Kit and Kiro keep them near the change workflow: requirements, design, and tasks for one piece of work. OpenSpec moves them into the repo as a decision record that survives the change.

Tessl pushes further and asks whether the spec itself should become the thing you edit, which is where the Dijkstra objection lands hardest: “a sufficiently detailed spec is code.” Intent treats the spec as shared state. Symphony treats it as an orchestration contract for autonomous runs.

Each one tries to pin the agent down at a different point.

Underneath the product differences, they keep rebuilding the same skeleton: durable context, feature intent, a technical plan, explicit tasks, and verification. The goal is to give the agent less room to improvise.

So what would the ideal model look like today? Smaller than most current tools imply, with a cleaner handoff between intent and execution.

The spec should be declarative, so the agent matches the code to the intent instead of replaying a brittle patch script. It should be layered, so product requirements do not quietly turn into architecture and technical plans do not quietly add product scope. And it has to be cheap to revise. If a spec is expensive to update, replace, or delete, the process hardens into ceremony and the ceremony becomes the work.

Where a rule can be enforced mechanically, move it out of the spec and into lint, schemas, tests, or the harness. Use less prose. Enforce more. Specs matter, but they are only one layer. Full SDD should stay optional for small bug fixes, fast prototypes, and exploratory UX.

The winning model puts a narrow interface between human intent and machine execution: intent narrows the search space. Code, tests, and harnesses govern behavior. Smaller specs, harder checks, less guessing.

Using Claude Code from Anywhere

Matt Rickard — Sat, 30 Aug 2025 18:36:53 GMT

I've been using multiple instances of Claude Code and Codex CLI almost every day. But I've gotten frustrated enough to build something that solidifies my workflow. Before, it looked something like this:

git worktree for parallel instances
docker for sandboxing work and tooling
tmux for automation and management of terminal emulator windows
ssh to a cloud instance for managing work on-the-go.

But I was frustrated by a few things:

Parallelism tax. Even with automation, the setup/clean-up grind is tedious. Worktrees share the same git object store, so you still need to be careful with operations and cleanup. Managing claude in docker means that I need to mount files, move around secrets, and manage environment. Remote instances need to be synced.
Laptop-locked. SSH from mobile or an iPad will probably never be a good experience, especially with a long-running process like claude code. Laptops aren't made to be treated like servers.

Current solutions are good, but have some shortcomings.

Unsupervised agents (Codex Web / Claude Code GitHub Actions). Short feedback loops make Claude Code great. If it makes a wrong turn, you can interrupt and get it back on the right path. Codex Web and Claude Code GitHub Actions are powerful, but often times spend 15 minutes working on a technically correct, but wrong implementation of a feature. Or they get blocked on something that you could have fixed easily.
SSH into a VM. You become the platform team: images, secrets, logs, UI, lifecycle. Not a bad choice, but lots of work.
Desktop UI: Solves some of the terminal-bound issues: window management, worktree automation, syntax highlighting, patch management. However, still laptop bound.

So my new workflow:

Web UI → ephemeral sandbox per chat → live, interactive session → patch/PR

On-demand sandbox execution: Ephemeral, quick to boot, isolated jobs per task with code, tools, and AI agents.
Live, steerable session. Stdout/stderr stream in real time; I can interrupt/approve and keep the loop tight—same Claude Code behavior, just remote.
Chat Management. Automated branch-per-chat and pull-request creation. Persistence for chats and code changes that isn't in your $HOME folder.

I put up an early version on standard-input.com. Let me know what you think. I'll buy you a coffee if you break out of the sandbox. dangerously-skip-permissions has been renamed to vibe.

Pseudonyms in American History

Matt Rickard — Tue, 05 Dec 2023 14:30:47 GMT

Debates around the ratification of the Constitution and the early formation of the United States happened through pseudonymous authors. They often used names borrowed from Greek or Roman History.

Why?

Plausibly some protection against retaliation. However, most pseudonymous writing was quickly attributed to authors.
Power in names. The names weren’t chosen at random. Often, they called back to famous Romans who took part in the formation of the Roman Republic. Or others who were known for their virtue or principles.

Alexander Hamilton might have written under the most pseudonyms (at least five). Benjamin Franklin used at least three. Here’s a list of some of the more popular ones around the time of the American Revolution.

Phocion (Alexander Hamilton) — Essays defending the Jay Treaty with Great Britain. Phocion was an Athenian statesman known for his integrity and opposition to demagoguery.

Columbus (Alexander Hamilton) — Defending the Continental Congress and criticizing British policies.

Publius (Alexander Hamilton, James Madison, John Jay) — The authors of the Federalist Papers, which were a series of essays advocating for the ratification of the Constitution. Individual authorship wasn’t released until Hamilton’s death, and even then historians are still trying to match authors to text. It’s hypothesized that Hamilton wrote 51 essays, Madison 29, and Jay 5. Publius Valerius Poplicola was a Roman consul known for his role in founding the Roman Republic.

Historicus (Alexander Hamilton) — Essays on various topics related to the Constitution and federalism.

Pacificus (Alexander Hamilton) — Used to defend President George Washington's Neutrality Proclamation of 1793 (declared the U.S. neutral in the conflict between France and Great Britain). “Making peace” in Latin.

Helvidius (James Madison) — Written in response to Pacificus (Hamilton), these essays defended the constitutional authority of Congress in foreign affairs. Helvidius Priscus was a Roman senator known for his defense of republicanism and freedom of speech.

Americanus (John Jay, John Stevens, Jr.) — Federalists essays.

Candidus (Benjamin Franklin) — Writings advocating for various causes, including opposition to oppressive British policies.

Silence Dogood (Benjamin Franklin) — A fictitious widow created by Franklin to offer social commentary.

Richard Saunders “Poor Richard” (Benjamin Franklin) — Used to publish Poor Richard’s Almanack. The name comes from a popular London almanac, Rider’s British Merlin.

“Common Sense” — Thomas Paine’s pamphlet advocating for American independence was initially published anonymously.

Cincinnatus (Arthur Lee) — Anti-federalist papers.

A Farmer (John Dickinson) — Essays titled "Letters from a Farmer in Pennsylvania," which argued against the Townshend Acts imposed by the British.

Cato (George Clinton) — Anti-federalist essays around the time of the ratification of the Constitution. Attributed to George Clinton, but not confirmed. Cato the Younger was a Roman statesman known for his staunch republicanism and opposition to Julius Caesar.

Brutus (Robert Yates) — An ally of George Clinton’s who wrote more anti-federalist essays. Marcus Junius Brutus was a Roman senator famous for his role in the assassination of Julius Caesar, symbolizing resistance to tyranny.

Centinel (Samuel Bryan) — A series of anti-federalist essays critical of the proposed U.S. Constitution's centralizing tendencies.

Americanus (John Stevens, Jr.) — Essays written to support the Federalist cause and the ratification of the U.S. Constitution.

Poplicola (John Adams) — Essays defending the British constitution and criticizing the Stamp Act. The same Publius Valerius Poplicola used by Hamilton.

Novanglus (John Adams) — A series of essays written in response to Massachusettensis, defending colonial rights. Latinization of “New Englander”.

A Citizen of New York (Martin Van Buren) — political essays.

Fairchildren

Matt Rickard — Mon, 04 Dec 2023 14:30:18 GMT

In 1956, William Shockley, Stanford professor and winner of the Nobel Prize in Physics for his work on semiconductors, recruited a team of young Ph.D. graduates to product a new company. The company would be called Shockley Semiconductor.

But Shockley was a terrible manager, and the students left to form their own company the next year, Fairchild Semiconductor. They would be later known as the “traitorous eight”.

The founders of Fairchild Semiconductor were: Gordon Moore, C. Sheldon Roberts, Eugene Kleiner, Robert Noyce, Victor Grinich, Julius Blank, Jean Hoerni, and Jay Last.

Fairchild Semiconductor became the proto-company of Silicon Valley. Many major technology companies can somehow trace their founding or story to Fairchild.

Intel - Founded by Robert Noyce and Gordon Moore, both former employees of Fairchild Semiconductor.

AMD (Advanced Micro Devices) - Founded by Jerry Sanders, another Fairchild alumnus.

Kleiner Perkins - A venture capital firm co-founded by Eugene Kleiner, a former Fairchild employee.

Sequoia Capital— Don Valentine worked at Fairchild Semiconductor for seven years before moving to National Semiconductor (another Fairchild). Then he started Sequoia Capital.

Other companies founded by Fairchild employees: SanDisk, National Semiconductor, Altera, LSI Logic, Amelco, Applied Materials, and more.

ChatGPT After One Year

Matt Rickard — Sun, 03 Dec 2023 14:30:38 GMT

ChatGPT was released on November 30th 2022. What has changed since then?

Hundreds of open-source models. Varying sized models from small to very large. Many are chat-tuned similar to ChatGPT.
Distilled models from ChatGPT. Academics and competitors both used data from ChatGPT conversations to train or fine-tune their own models.
Competition. Microsoft launched Bing Chat. Google launched Bard. Poe, Pi, Perplexity. Claude by Anthropic. Not to mention self-hosted open-source chat UIs and other wrappers. There’s no shortage of competition (although ChatGPT still is the most popular).
RAG is hard. “Browse with Bing” and Bing Chat launched but hallucinations are still an issue. Browsing the internet doesn’t seem like the catch-all
Not every launch increased performance across the board. Every new iteration of ChatGPT launched changed the way the model behaved. Many queries got better. Some got worse. Google has always had this problem as well, but applications aren’t build on Google.
A consumer subscription model. ChatGPT Plus was released in February 2023. The consumer model maybe competes with the developer and enterprise products (why not just use the API?).
Multi-modal. ChatGPT started to accept images and files in the chat. DALL-E and the vision API became integrated into the chat window. There are open-source models that are multi-modal, but so far no experience is as sleek as OpenAI’s.
Plugins launched but never found product-market fit. Plugins launched but didn’t become the App Store that OpenAI hoped. Custom GPTs seem to be the next strategy for extensibility, although they won’t launch until next year.
Code Interpreter is getting better. Agents and tool-use is still hard for LLMs. But it’s getting better and becoming more useful. Files can now be added directly to the UI to chat with.

McNamara Fallacy

Matt Rickard — Sat, 02 Dec 2023 14:30:26 GMT

The McNamara Fallacy is named after Robert McNamara, the US Secretary of Defense during the Vietnam War. The fallacy describes making decisions using only quantitative metrics and ignoring anything else.

The fallacy usually follows the same four steps.

Measure what can easily be measured.
Dismiss what can’t be measured easily.
Presume what can’t be measured easily isn’t important.
Extrapolate and conclude that what can’t be measured doesn’t exist.

You can find the McNamara Fallacy in all types of disciplines. The emphasis on standardized tests in education (at the expense of less quantifiable qualities and learning). Or when the success of treatments in medicine is based only on easy to measure outcomes (not quality of life, mental health, or overall well-being). Or optimizing for short-term financial metrics at the expense of brand reputation, employee satisfaction, or other intangibles.

Data Quality in LLMs

Matt Rickard — Fri, 01 Dec 2023 14:30:38 GMT

Good data is the difference between Mistral’s LLMs and Llama, which share similar architectures but different datasets.

To train LLMs, you need data that is:

Large — Sufficiently large LMs require trillions of tokens.
Clean — Noisy data reduces performance.
Diverse — Data should come from different sources and different knowledge bases.

What does clean data look like?

You can de-duplicate data with simple heuristics. The most basic would be removing any exact duplicates at the document, paragraph, or line level. More advanced versions might look at the data semantically, figuring out what data should be omitted because it’s better represented with higher quality data.

The other dimension of clean data is converting various file types to something easily consumed by the LLM, usually markdown. That’s why we’ve seen projects like nougat and donut convert PDFs, books, and LaTeX to better formats for LLMs. There’s a lot of training data that’s still stuck in PDFs and human-readable but not so easily machine-readable data.

Where does diverse data come from?

The surprising result of the success of the GPTs is that web text from the Internet is probably one of the most diverse datasets out there. It contains usage and data that aren’t found in many other data corpora. That’s why models tend to perform so much better when they’re given more data from the web.

Discord and AI GTM

Matt Rickard — Thu, 30 Nov 2023 14:30:55 GMT

Midjourney is the largest discord server, with 16.5 million total users. It accounts for 13% of total Discord traffic. Midjourney launched in March 2022 and doesn’t have a web application. Many other AI apps (Leonardo, Pika, Suno, And AI Hub) are on Discord (or even Discord-only).

Why is Discord such a good GTM for AI applications?

Text interface. Most users are just generating images, videos, and audio in these Discord servers. Prompts are easily expressible in simple text commands. It’s why we’ve seen image generation strategies like Midjourney (all-in-one) flourish in Discord while more raw diffusion models haven’t grown as quickly (e.g., Stable Diffusion with many configurable parameters).
Virality. Prompt engineering models is difficult and more art than science (today). Users can see generations by other users and collectively see what’s working and what isn’t. This means that these communities often have the most advanced prompts and best images.
Low friction. Go to where your users already are. Most developers have Discord now. One fewer application to sign up for.
Free hosting. Discord pays for the image hosting and bandwidth. At Midjourney scale, this is not negligible.

But Discord has it’s risks as a platform to build on.

Platform risk. Discord could (easily?) build its own Midjourney-type application into the platform. Using all of the prompt-image pairs (along with reactions as a RLHF), it could probably distill a much better model from Midjourney (questionably legal but technically easy). This reminds me of the Zynga / Facebook relationship. Zynga accounted for 19% of Facebook’s revenue at one point. Facebook reduced Zynga’s API access and launched its own gaming platform.
Multi-modal. How does multi-modal fit into the Discord text-first interface? Sure there are images and audio that can be uploaded via the interface, but it’s hard to image the UI that a multi-modal AI will need in the future.

Standard Causes of Human Misjudgment (Munger)

Matt Rickard — Wed, 29 Nov 2023 14:45:33 GMT

In 1995, Charlie Munger gave a speech at Harvard on The Psychology of Human Misjudgment. It was filled with the research he had done later in life on human psychology, matched with real-life examples that he had observed in his work. The result was a succinct list of the top cognitive biases grounded in real-life experiences. I’ve summarized the biases here, but it’s worth giving the entire speech a listen to hear the stories behind each. I’ve tried to keep Charlie’s language and numbering when possible.

Underestimation of Incentives: Despite understanding the significant influence of incentives (reinforcement in psychology and incentives in economics), there's a tendency to consistently underestimate their power.
Psychological Denial: This is the refusal to accept reality because it is too painful or difficult to bear.
Incentive-Cause Bias: This occurs when personal incentives or those of a trusted advisor create a conflict of interest, leading to biased decisions.
Bias from Consistency and Commitment: This involves a strong tendency to stick to pre-existing beliefs or commitments, even in the face of contradictory evidence.
Bias from Pavlovian Association: This bias refers to the error of basing decisions on past associations or correlations without considering their current relevance or accuracy.
Bias from Reciprocation Tendency: This bias involves a natural inclination to reciprocate actions and behaviors, including conforming to others' expectations, especially when one is experiencing success or is 'on a roll.'
Bias from Over-Influence by Social Proof: This bias refers to the heavy reliance on the actions or decisions of others, especially in situations of uncertainty or stress.
Bias from Favoring Elegance over Practicality in Theory: This bias involves a preference for theories or explanations that are mathematically elegant or intellectually satisfying, even if they are less accurate in practical terms. “Better to be roughly right than precisely wrong” — Keynes.
Bias from Contrast-Induced Distortions: This bias refers to the way our perceptions, sensations, and cognition can be significantly altered by contrasts.
Bias from Over-Influence by Authority: This bias involves the tendency to conform to instructions or opinions provided by an authority figure, even when these instructions conflict with one's own moral judgment or common sense.
Bias from Deprival Super Reaction Syndrome: This bias is characterized by an intense reaction to losing or the threat of losing something, especially something that one perceives as almost possessed but never fully owned.
Bias from Deprival Super Reaction Syndrome: This bias is characterized by an intense reaction to losing or the threat of losing something, especially something that one perceives as almost possessed but never fully owned.
Bias from Envy/Jealousy: This bias stems from feelings of envy or jealousy towards others.
Bias from Chemical Dependency: This bias relates to the cognitive and behavioral changes that result from chemical dependency, such as addiction to drugs or alcohol.
Bias from Gambling Compulsion: This bias refers to the compulsive urge to gamble, driven by the psychological principle of variable reinforcement.
Bias from Liking Distortion: This bias involves a preference for things that are familiar or similar to oneself, including one's own ideas, kind, and identity.
Bias from Disliking Distortion: This is the opposite of liking distortion, where there's a tendency to reject or not learn from sources that are disliked.
Bias from the Non-Mathematical Nature of the Human Brain in Probability Assessment: This bias refers to the human brain's tendency to rely on crude heuristics and be easily misled by contrasts when dealing with probabilities, rather than using precise mathematical approaches.
Bias from Over-Influence by Extra Vivid Evidence: This bias describes the tendency to give disproportionate weight to particularly vivid or emotionally striking information when making decisions.
Stress-induced mental changes, small and large, temporary and permanent.
Mental Confusion from Poorly Structured Information and Inadequate Explanations: This bias involves difficulties in understanding or decision-making due to information that is not well-organized or lacks a coherent theoretical framework.

The Unreasonable Effectiveness of Monte Carlo

Matt Rickard — Tue, 28 Nov 2023 14:45:32 GMT

Monte Carlo methods are used in almost every branch of science: to evaluate risk in finance, to generate realistic lighting and shadows in 3D graphics, to do reinforcement learning, to forecast weather, and to solve complex game theory games.

There are many types of Monte Carlo Methods, but they all follow a general pattern — using random sampling to model complex systems.

A simple example: Imagine a complex shape you want to know the area of.

Place the shape on a dartboard.
Randomly throw darts at the dartboard.
Count the number of darts that are inside the shape and outside.
The estimated area of the shape is = (number of darts in shape / number of darts outside of shape) * the area of the dartboard.

(This is computing a definite integral numerically with a method that doesn’t depend on the dimensions! You can even easily estimate the error given the number of samples).

Monte Carlo Tree Search (MCTS). Or use it to play a game like Blackjack (Chess, Go, Scrabble, and many other turn-based games) with Monte Carlo Tree Search. AlphaGo and its predecessors (AlphaGo Zero and AlphaZero) used versions of Monte Carlo Tree Search with reinforcement learning and deep learning.

The idea is fairly simple — add a policy (i.e., a strategy to follow) to the random sampling process. You might start with a simple one (random or stay with a hand under 18). For every move in a game, add that to a tree that describes the game. For Blackjack, that might be a series of hits or stays. When a game is won or lost, go back and update all of the nodes in the tree for that game (the “back propagation”).

After many games, you have a tree of expected utility for each move — that means you can sample the next move much more effectively. The value says something like — “given this current hand and set of actions, I won X% of the time”. You can get more advanced with the reward and update function — for example, you might discount wins that take many turns and prioritize quicker wins.

Razor and Blades Model

Matt Rickard — Mon, 27 Nov 2023 14:30:38 GMT

The profit margin on Keurig machines is very low and sometimes even negative. On the other hand, the K-cup coffee pods have much higher profit margins.

The business model: sell one item at break-even or for free to increase the sales of the complementary good. This is the “razor and blades” model. (Despite being named after the safety razor industry, early companies like Gillette didn’t initially follow this model).

This model works especially well when there are switching costs or vendor-lock in. If there are no switching costs, other providers can come in and compete margins away from the complementary good. When the K-cup patent expired in 2012, prices came down when competitors started producing compatible pods.

Or when a producer owns a monopoly on the complementary good. John D. Rockefeller and Standard Oil gave away eight million kerosene lamps. Demand for kerosene (conveniently sold by Standard Oil) skyrocketed.

Some other examples of the razor and blades model:

Kindle e-reader / digital books.
Video game console / video games
Mobile phone / cellular data plan
Electric toothbrush / replacement brush heads
Printers / ink cartridges
E-cigarettes / e-cigarette pods

Drawbacks of Moving to the Edge

Matt Rickard — Sun, 26 Nov 2023 22:58:52 GMT

Edge runtimes are often lauded as a fix to all latency concerns. But sometimes, moving to the edge can increase latency.

The problem: databases are still regional. If you move your application logic closer to the user via edge functions in multiple regions, this most likely increases the distance between your application and your database. Since the latter is often more chatty (more data sent back and forth between the application and database than the user and the application), this usually increases latency.

Could you make data multi-regional? Sort of. There’s so work being done to bring the database to the edge (see distributed SQLite), but now with stateful data at the edge, you have a complicated distributed systems problem.

Smarter caching? There’s also some work being done in application frameworks to do smarter caching (e.g., stale-while-revalidate) so that users get fast responses for most of the application while new data is rehydrated.

Are Things Getting Worse?

Matt Rickard — Sat, 25 Nov 2023 14:45:29 GMT

Cory Doctorow called it “enshittification”. Are things getting worse?

Here is how platforms die: first, they are good to their users; then they abuse their users to make things better for their business customers; finally, they abuse those business customers to claw back all the value for themselves. Then, they die. I call this enshittification, and it is a seemingly inevitable consequence arising from the combination of the ease of changing how a platform allocates value, combined with the nature of a "two sided market," where a platform sits between buyers and sellers, hold each hostage to the other, raking off an ever-larger share of the value that passes between them.

I tend to be an optimist. I think, generally, things are getting better. The Romans had a word for the idea that we judge the past much more positively than the future, “memoria praeteritorum bonorum”. On one hand, many platforms seem to no longer be in their golden age. On the other hand, they are used by more users than ever. Networks grow to a point where the initial magic no longer applies to early users. There was “Eternal September” for Usenet. Early users love to glorify the “good old days”.

Companies go through natural cycles where they create and capture value. When incentives are aligned, things work extremely well (Google Search quality/page load speed, or Amazon and low prices). But, profit-maximizing companies sometimes overreach and try to capture too much value. This creates opportunities for competitors (if anything, the cycles are becoming faster).

How AI Changes Workflows

Matt Rickard — Fri, 24 Nov 2023 14:45:30 GMT

GitHub recently said it was “re-founding” itself on Copilot instead of git. GitHub has always been about the workflow — there are plenty of other hosted git providers, but GitHub was the first to put together pull requests, issues, and collaboration into a single workflow. Re-founding on Copilot is a way to acknowledge that AI will drastically change the developer workflow.

Some more general lessons on how AI changes workflows, using the developer workflow as an example

The same but faster steps. Copilot is an incumbent business model when used this way. Doing the same things that we’ve always done, but just faster with the help of AI. That means autocompleted code or AI-assisted code reviews. AI-generated commit messages.

Compressing the workflow. AI might help us skip steps in the workflow. Developers have tried to make pre-commit workflows work for decades, but they’ve always failed because they can’t be centralized well (if you automatically change the code before it’s committed, there’s a chance that your automated changes end up with a broken main branch).

What if AI could determine “low-risk” change sets that could be merged without a review?
Why have AI-generated commit messages if they don’t matter in the first place? Commit messages could be generated on-demand (or post-commit)
Automatic merge conflict resolution and automatic linting and style checking.

A new workflow. If so many of the steps don’t make sense anymore, the whole workflow might come into question.

Maybe issue tracking comes before code in future DevOps platforms.
AI will write most code in the future. What’s the implication? Does all the code need to be checked in?

Extends the platform to support more workflows. Especially in enterprise software, almost every company’s workflow is different in a certain way. SaaS products extend themselves into platforms in a variety of ways — letting users customize via a WYSIWYG interface, configuration, or even code. But platform extension comes with its own problems — open up too much and you can’t support your customers on a large scale. Open up too little, and niche platforms chip away at your customer base.

DSLs often fail. But products might find it easier to become platforms in the age of AI. Giving the users the ability to autogenerate DSLs or generic code to extend their platform (even if they are semi-technical, or not technical at all). Imagine every platform could be as extensible as Salesforce — its own programming language and toolchain.