Skip to content
← All posts
Essays
June 6, 2026 · Updated June 12, 2026 · 4 min read

Your coding agent spends 76% of your money re-reading your codebase

Here is a number that should bother you more than it does: measured across real agent workloads, 76.1% of a coding agent's tokens go to read-type operations — listing directories, opening files, grepping, re-opening files it already saw (SWE-Pruner, arXiv 2601.16746). Writing code — the thing you bought the agent for — is about 12%.

You are paying for a very expensive reader that occasionally writes.

The mechanism is dumber than you think

Two facts about how these systems work, neither of them secret, both of them weird when you put dollar signs on them:

First: the model has no memory. Not between sessions — not even between messages. Every API call re-sends the entire accumulated context. The cost of an agent session therefore grows with the square of its length: a 20-step loop at 1,000 tokens a step bills roughly 210,000 cumulative input tokens, not 20,000. Every file your agent reads at step 3 travels with steps 4 through 50.

Second: the reading unit is the file. The agent needs one function's behavior; it reads the 500-line file. It needs to know who calls that function; it greps and reads whatever matched. Tomorrow, a fresh session needs the same function — same file, read again, full price. One GitHub issue documents cache reads — re-ingestion of context the model had already processed — consuming up to 99.93% of a user's quota.

Multiply: dollars-per-session that scale quadratically, dominated by reads, most reads rediscovering something a previous step or session already established. That's the bill. When usage limits tightened in early 2026 and developers reported five-hour windows draining in 90 minutes, the burn rate underneath was this — not (only) the quota.

The standard advice optimizes the 24%

Walk the standard cost playbook against the 76% number:

  • "Use a cheaper model" changes the rate, not the volume. And a cheap model that fails a multi-file task re-runs the whole loop — cost per token down, cost per task up.
  • "Trim your context" evicts tokens after you paid to read them, and loses fidelity when it guesses wrong about what the task still needed.
  • "Use prompt caching" discounts re-sent prefixes — until you edit a file or touch a tool definition, which invalidates everything after it. Caching protects the static head of the prompt; agents spend in the moving tail.

All three are worth doing. None of them touches the question that sets the bill: why is the agent reading whole files to answer structural questions in the first place?

"Who calls this function" is not a semantic search problem

Most of what an agent wants from your codebase has an exact answer. Where is this defined? What's in this file? Who calls this? What does it import? These are index lookups — your IDE answers them in milliseconds without "reading" anything. The agent answers them by grepping and ingesting files at billed rates, because its tools give it no better primitive.

Interestingly, the highest-scoring agent systems on SWE-bench don't use embedding retrieval over repos either — similarity search gives approximate answers to exact questions, and a silently missing caller ships a breaking change. They use file trees and grep: exact, but token-maximal. The contest that matters isn't exact-vs-fuzzy; it's expensive-exact vs cheap-exact.

What cheap-exact looks like

Parse the repo once, locally. Keep the structure — definitions, references, call edges, outlines — in an index that updates as the code changes. When the agent asks a structural question, answer it from the index: the function body, the caller list, the outline. Not the file.

We benchmarked this head-to-head on real repos with a real tokenizer and a fidelity gate (the compressed answer must contain every fact the task needed): −86% navigation tokens, −90% read tokens against stock agent tools. The savings compound, because every token not admitted to context also doesn't re-bill on every subsequent roundtrip of the session. If you want to sanity-check the arithmetic on your own workload, the interactive model is here — it's the four-meter math, no signup.

The deeper point isn't our numbers. It's that the agent-cost conversation keeps optimizing rates — model prices, cache discounts, plan tiers — while three-quarters of the spend is a systems problem: agents lack a memory of structure, so they buy the same knowledge over and over at token prices. Fix the rates and you get a discount. Fix the rediscovery and the cost mostly stops existing.


The methodology, repos and reproduction steps for our measurements are in the benchmark write-up. Background: token optimization for coding agents · why agents forget everything · what unerr is.

See it on your own repo

Free to start. One install, your codebase, real numbers.