Eight more months of agents

2026-02-08

I wrote up my experiences programming with LLMs a bit over a year ago, and updated it for the world of agents eight months ago. A lot has changed since then, so here is an update.

Agents have improved dramatically in a year

We were prototyping our first agent, Sketch, when Claude Code was released 12 months ago. So I, by good fortune, got to be there and be excited right at the beginning. They could be helpful for some things some of the time!

Agent harnesses have not improved much since then. There are things Sketch could do well six months ago that the most popular agents cannot do today. The agent harness is critical, there is plenty of innovation to be done there, but it is as interesting a space right now as compiler optimizations were during the megahertz explosion of the 1990s.

Right now, it is all about the model.

And on the models: there are plenty of public benchmarks but they have all been gamed to death. Ignore them. Clearly the frontier model companies have good internal evals, because the models have qualitatively changed dramatically. In February last year, Claude Code could write a quarter of my code. In February this year, the latest Opus model can write nine tenths of my code. It all needs to be carefully read, and regularly adjusted, but now I can and do rely on the model to do the adjustments for me.

There has been no obvious change in models. Nothing like when GPT2 started talking back. There has however, clearly been a huge incremental improvement in the ability of coding models to get to useful results. (All of this, admittedly qualitative, progress is the most positive economic signal I see today.)

At a big company, my time was 80-20 reading code to writing code. At a startup, it used to be closer to 50-50. Now it is 95-5.

IDEs are clearly waning

The history of IDEs is so strange.

On the one hand, the IDE is obviously correct. Of course I should have a development environment that provides as much information and assistance as I can effectively use. By far the greatest IDE I have ever used was Visual Studio C++ 6.0 on Windows 2000. I have never felt like a toolchain was so complete and consistent with its environment as there.

Since those glorious moments in 1999, I have spent more of my programming life outside of IDEs than in them. The truth of programming environments is they are a hot mess. Unix was great, the Howl's Moving Castle we have bolted onto an over-taxed set of Unix concepts, not so much. The same thing happened to that win32 API I used to use in VS6.0, still there, with a giant mess atop and around it and entirely unignorable.

Then co-pilot came out and it seemed the IDE was inevitable. It did not matter how miserable it was trying to fit your IDE into your environment, you had to do the work because LLM-assisted auto-complete and edit were too powerful to ignore. They made my typing go 50% further and a large amount of the programming I do is typing limited, so the effect was enormous.

In 2021, the IDE had won.

In 2026, I don't use an IDE any more.

The degree of certainty I felt about a copilot future, and the astonishing whiplash as agents gave me a better tool not four years later still surprises me.

The only IDE-like feature I use today is go-to-def, which neovim is capable of with little configuration. So here I am, 2026, and I am back on Vi.

Vi is turning 50 this year.

Using anything other than the frontier models is actively harmful

A huge part of working with agents is discovering their limits. The limits keep moving right now, which means constant re-learning. But if you try some penny-saving cheap model like Sonnet, or a second rate local model, you do worse than waste your time, you learn the wrong lessons.

I want local models to succeed more than anyone. I found LLMs entirely uninteresting until the day mixtral came out and I was able to get it kinda-sorta working locally on a very expensive machine. The moment I held one of these I finally appreciated it. And I know local models will win. At some point frontier models will face diminishing returns, local models will catch up, and we will be done being beholden to frontier models. That will be a wonderful day, but until then, you will not know what models will be capable of unless you use the best. Pay through the nose for Opus or GPT-7.9-xhigh-with-cheese. Don't worry, it's only for a few years.

Built-in agent sandboxes do not work

The constant stream of "may I run cat foo.txt?" from Claude Code and "I tried but cannot go build in my very-sophisticated sandbox" from Codex is a nightmare. You have to turn off the sandbox, which means you have to provide your own sandbox. I have tried just about everything and I highly recommend: use a fresh VM.

I have far more programs and services than I used to

This is why I am building exe.dev. I need a VM, with an unconstrained agent, that I can trivially start up and type the one liner I would have otherwise put into an Apple Note named TODO and forgotten about. A good portion of the time Shelley turns a one-liner into a useful program.

I am having more fun programming than I ever have, because so many more of the programs I wish I could find the time to write actually exist. I wish I could share this joy with the people who are fearful about the changes agents are bringing. The fear itself I understand, I have fear more broadly about what the end-game is for intelligence on tap in our society. But in the limited domain of writing computer programs these tools have brought so much exploration and joy to my work.

I am extremely out of touch with anti-LLM arguments

New technology brings a lot of challenges and reasonable concerns. I spend my days trying to push the limits of agents, so I see them fail catastrophically several times a week. Significant change also changes labor markets which has many effects, good and bad. In 1900, 33% of Americans lived on a farm, and 40% worked in agriculture. In 2000, less than one percent lived on farms and 1% of workers are in agriculture. That was a net benefit to the world, that we all don't have to work to eat. (The numbers are even more dramatic if you go back another century.) But a lot of pain and heartbreak can and did happen along the way. It is right to be concerned.

But far more than measured analyses of the reality of the changes that are happening, I see hard anti-LLM takes that a year ago I disagreed with, and now I just cannot understand. It sounds like someone saying power tools should be outlawed in carpentry. I deeply appreciate hand-tool carpentry and mastery of the art, but people need houses and framing teams should obviously have skillsaws. To me that statement is as obvious as "water is wet".

A lot has to change

Most software is the wrong shape now. Most of the ways we try to solve problems are the wrong shape.

To give you an example, consider Stripe Sigma. This product is a nice new SQL query system for your Stripe DB. It has a little LLM built into it to help you write queries. The LLM is not very good. I want Claude Code or Codex writing my queries. But Stripe launched a fancy Sigma UI with an integrated helper before their API. There is a private alpha for the SQL REST endpoint that I do not have access to yet. So instead I had my agent do ETL-from-scratch: it used the standard Stripe APIs to query everything about my account, build a local SQLite DB, and now my agent queries against that far better than Sigma can.

I implemented that entire Stripe product (as it relates to me) by typing three sentences. It solves my problem better than their product.

That's the world we are in today. By far the worst product I had to use every day in this new world were clouds, so that's what I'm building over at exe.dev. It's a lot harder than it looks, but the entire point of the product is you should never feel that your agent should rewrite part of it for you.

Along the way I have developed a programming philosophy I now apply to everything: the best software for an agent is whatever is best for a programmer. The practical nature of writing software for customers has traditionally pushed us away from that philosophy. Product Managers have long had to find gentle ways to tell engineers: you are not the customer. Well, that has all been turned on its head. Every customer has an agent that will write code against your product for them. Build what programmers love and everyone will follow.

Hopefully that philosophy will survive the next year of changes wrought by LLMs.


Index
github.com/crawshaw
twitter.com/davidcrawshaw
david@zentus.com