Ben Lai — Blog

Local models won the long tail

Wed, 27 May 2026 00:00:00 GMT

I was on a call with a CTO who runs a developer-tools company. They started with GPT-4 for everything. Over the last six months they moved seventy percent of their volume to a local model running on a single H100. The remaining thirty percent is on Claude or Gemini, depending on the task. Their inference bill went from thirty-four thousand dollars a month to twelve hundred.

This is not a story about whether local models match frontier on paper. They don’t, on paper. It’s a story about where the actual call volume goes once a product ships. The frontier handles the hard five percent — the open-ended generation, the genuinely novel reasoning, the user-facing chat. The other ninety-five percent is classification, extraction, formatting, retrieval reranking, summarisation of structured input. A 70B-class model finetuned on the company’s own traffic for an afternoon will outperform a generic frontier model on those tasks, because the eval set is the company’s actual data.

What changed

Three things made the math obvious in the last twelve months:

Open-source quality. The current generation of open weights closed enough of the quality gap on standard tasks that the cost difference stopped being defensible.
Inference engines. vLLM and friends made batched throughput on a single GPU competitive with managed inference at lower volumes than people realised.
Finetuning got cheap. A QLoRA on five thousand of your own examples, on a rented A100 for three hours, beats prompt engineering for almost any narrow task.

The diagnostic

Pull a week of your highest-volume agent endpoint. Categorise the requests by task type. If sixty percent or more fall into one or two narrow categories, you are paying frontier prices for a problem a finetune solves better.

The frontier model will still be in your stack. It just shouldn’t be in your hot path.

Customer support is product research wearing a costume

Mon, 25 May 2026 00:00:00 GMT

I answered support tickets for the first six months of the company. Then I stopped, and within a quarter we had shipped three features nobody asked for. The inbox knew what to build. I had stopped reading it.

Support is the cheapest, highest-signal product research function a small company has. Customers tell you precisely what’s wrong, on their own time, in their own words. They give you a paid panel of users who are willing to write down what frustrates them — for free. Then most teams hand the inbox to a CS hire whose job is closing tickets fast, and the founder never reads it again.

What you lose when you outsource it

Three things, in order of which one hurts first:

The wording. Customers describe problems in language your team doesn’t use. That language is also how prospects describe the problem on calls. You lose the vocabulary if you only see closed-ticket summaries.
The frequency. “How often does X come up” is the question every roadmap argument turns on. The CS team knows the answer; the founder learns it third-hand and gets it wrong.
The shape of the bad questions. The questions that “shouldn’t” be there — “how do I X” when X is documented — are the ones where your docs failed. You only see those if you read tickets.

What the founder should actually do

Read tickets for an hour every Friday. Not closing them — reading them. Look for: the same question asked three different ways, the workaround a customer described that’s better than your intended flow, the feature request that’s actually a bug they’re working around.

The CS team can close the tickets. The founder reads the inbox. These are not the same job.

The honest part

It’s tempting to stop because tickets are unpleasant. Some are angry. Some are confused. Some are written badly. That’s the job. The unpleasantness is the signal. A company where the founder thinks “support has it under control” is a company that’s shipping the wrong things, on time.

The eval you don't have, production already wrote for you

Sat, 23 May 2026 00:00:00 GMT

I asked a team how they evaluated their agent. They had a deck. The deck had four slides about MMLU and one chart of an internal benchmark with twelve examples on it. I asked how the agent was doing. They pointed at the chart. I asked how it was doing in production. There was a pause.

The answer was in their support inbox. Forty tickets a week, half of them screenshots of the agent saying something wrong, the other half of the agent doing something wrong. None of those tickets had ever been turned into an eval. They were rotting in a Slack channel where they’d be deleted in ninety days.

The dataset you already own

Three sources, in order of how cheaply you can mine them:

Human-edited outputs. Every time a user accepted, then edited, the agent’s draft, you have a labelled example: input → bad output → corrected output. Capture both halves.
Thumbs-down with comments. The thumb is noise. The comment is gold. Most teams aggregate the thumbs and throw away the text.
Re-asks. When a user immediately re-prompts after the agent answered, that’s a failure signal too. Log them. Group them. Ten re-asks of the same shape is a category.

Why nobody does this

Because there is no glamour in writing eval cases from a support queue. It feels like data entry. It is data entry. And it’s the only way your evals will ever look like the requests your users actually send, instead of the requests your team made up at offsite.

The team with the synthetic benchmark and the great-looking dashboards is not the team with the working agent. The team with the working agent has two thousand messy real-world cases pulled from production every week, and someone whose job is to keep that pipeline alive.

Your customers wrote your test set. You just have to take it.

Observability is a tax you pay before you owe it

Fri, 22 May 2026 00:00:00 GMT

The first incident is the one where you discover you don’t have logs. The second is the one where you discover the logs you have aren’t searchable. By the third, you have observability, and a story about how expensive it would have been to not have it.

Observability is a tax. It costs money in tooling, time to instrument, and discipline to keep current. There is no quarter where adding tracing to a new service is the feature the customer asked for. There is no demo for a histogram of p99 latencies. The work is unglamorous, and like most unglamorous work it pays in incidents you don’t have.

The minimum you owe yourself

Three things, in the order I’d add them to a fresh project:

A request ID that travels. Every request gets one. Every log line carries it. Every error carries it. The first thing a customer sends you when something is wrong is a screenshot — the request ID is what turns the screenshot into a trace.
One graph that says “is the site working right now.” Not ten. Not a dashboard with twenty panels. One number a non-engineer can look at and answer yes or no. The rest is debugging support.
An alert on user-visible failure, not on system noise. “Error rate above 1%” with a sane window. Not “CPU above 70%” — that wakes you up for a thing your customers don’t care about.

What you can wait on

Three things, in roughly the order most teams obsess over and shouldn’t:

Tracing every internal call. You don’t need spans inside spans inside spans. You need the request boundary, the database boundary, and the external-API boundary.
Custom metrics for every business event. Add them when an incident shows you you needed one. Anticipating is more expensive than the incident.
The fancy SaaS tier. Most teams hit revenue-relevant problems on the free tier. The paid features are real, but they’re cheaper to add later than to maintain early.

The trick is to instrument early enough that you have ground to stand on during an incident, and late enough that you didn’t waste six months gold-plating telemetry for a service that got rewritten.

One worktree per agent

Wed, 20 May 2026 00:00:00 GMT

Watched someone’s coding-agent demo last month. Six “agents” working in parallel on the same codebase, supposedly. Three of them were stepping on the same files. One was waiting on the other’s edit to compile. Two were debating whose change to keep. The aggregate throughput was lower than one engineer running quietly.

What was missing was the cheapest piece of plumbing in git: git worktree add. A worktree gives each agent an independent checkout of the same repo on the same disk, sharing the object store, isolated in everything that matters. Nothing about the tooling around AI coding has caught up to this yet, even though the primitive is twenty years old.

What changes when each agent owns a tree

Three things, in order of how often they break a swarm:

No file-level conflicts. Two agents can edit the same file in different trees and the merge happens at PR time, not at edit time.
No build-state interference. One agent’s cargo build doesn’t poison another’s target/. Test runs don’t race.
No “did I save?” doubt. Each tree is a real branch. The agent’s work is a commit in its own history, not a guess at what’s currently on disk.

The merge step is the actual product

The reason most parallel coding setups underperform is not the agents. It’s that nobody built the merge. Three independent worktrees that all open three independent PRs and then sit there waiting for a human is not a swarm — it’s a backlog.

If you want parallel coding agents to be more than a demo, the missing piece is an automated merge agent. It reads the three PRs, identifies overlap, asks the model to resolve textual conflicts, runs the test suite, and either merges or hands you one synthesised PR with the open questions surfaced.

Until that exists, three agents in three worktrees are the same throughput as one agent that knows how to commit.

Your first hire is a multiplier or a manager. Pick.

Mon, 18 May 2026 00:00:00 GMT

The two failure modes of first hires are “the second me” who does what I do twice, and “the senior person” who manages me instead of working. The good first hire does neither. They take work I can’t already do, and they refuse to manage.

This is not a hot take. It’s the version of every founder-hire blog post that founders ignore because they have a specific candidate in mind and the candidate sounds great. Three months later they’re paying twice for the same skill and meeting four times a week to discuss why.

The multiplier hire

A multiplier extends the surface area of what the company can do. You’re a backend engineer; they’re a designer-engineer who can ship a landing page and a Figma in a week. You’re a closer; they’re an integrator who can do the post-sale handoff. You can do half their job, badly. They can do half of yours, decently. The Venn overlap is small, and that’s the point.

The check: ask the candidate to describe a project they did entirely alone, end to end. If the answer fits your current gap, hire. If the answer sounds like “I led a team that…”, stop.

The manager hire

A manager hire is correct exactly once: when you already have three people doing different jobs, you can’t be the bottleneck for all of them, and a person whose primary skill is unblocking other people will replace more of your week than they cost. That moment is later than founders think. It is usually employee five or six, not two.

The check: ask yourself how many people the manager hire would manage on day one. If the answer is “you and one contractor,” the answer is “not yet.”

What to do instead

Hire the second person who can ship something you can’t. Hire the third the same way. Resist the “we need a head of X” instinct until you can clearly point at three people who would have a clear week of work for that person to coordinate.

By then you’re not making the first hire anymore. You’re making the fifth. The decision is much easier when you’re not lonely.

The output token tax

Sat, 16 May 2026 00:00:00 GMT

I looked at last quarter’s bill for an agent that summarises ticket queues for a forty-person engineering team. The input cost was flat against the year before. The output cost had tripled. Nothing about the workload had changed.

Two things happened to LLM pricing in the last twelve months. Input prices kept dropping — providers competed for context-window land. Output prices stopped dropping, and on some tiers went up. Reasoning models charge separately for the thinking tokens you never see. The asymmetry is now five to ten times: a token in costs ten cents per million, the same token out costs a dollar.

Most agent code was written when the two sides cost roughly the same. So the prompts ask the model to “think step by step” and “explain your reasoning” and “return the answer in JSON with a rationale field”. Each of those instructions is a tax on the most expensive part of the bill, paid every call, forever.

Where the tax is hiding

Three places, in roughly descending order:

Reasoning fields you never read. The agent fills in rationale so the prompt feels rigorous. Nobody reads it downstream. Delete the field.
Restated context. “Based on the user’s request to update their billing address, I will…” — the model is parroting the input back at output prices. Tell it not to.
Verbose tool plans. “First, I will call get_user. Then I will call update_address. Then…” Multi-tool reasoning loops generate output before each call. Move the planning to a single extract_intent call and let the executor be silent.

The diagnostic

Print the average output tokens per call from your fleet. If it’s bigger than 200, you have prose where you wanted JSON. If it’s bigger than 1,000, you have an essay generator.

Cut prompts. Tighten schemas. Stop asking models to think out loud unless you want a transcript. The bill rewards brevity now in a way it didn’t a year ago.

Postgres is a queue. Stop reaching for Kafka.

Fri, 15 May 2026 00:00:00 GMT

A team I know spent six weeks operationalising Kafka for a workload doing two hundred messages per second. Three nodes. ZooKeeper. Schema registry. Two engineers who had not used it before, becoming the on-call for it. They were proud. They had a job-queue product.

Postgres did the same job in forty lines of code. SELECT FOR UPDATE SKIP LOCKED is in every recent version. Throughput on a t3.medium is in the low thousands of messages per second without breaking a sweat. Delivery semantics are exactly-once because you’re inside a transaction. The thing you’ve already got running and paying for can do this.

When Postgres is enough

Three conditions, in roughly decreasing order of generosity:

Throughput under a thousand events per second. Most internal queues live here. The “we might scale” argument fails on the math — a tenfold growth still fits.
Producers and consumers are on the same database. If the consumer needs to update other tables when it processes, doing it in one transaction is the killer feature you give up by moving the queue out.
You can tolerate a Postgres outage taking the queue down too. For most internal workloads, your app already can’t survive a Postgres outage. The queue going with it doesn’t change your availability story.

When you actually need Kafka

Three real reasons, in order of how rarely they apply:

Multi-consumer fan-out with replay. You want N independent consumers each reading the same events, with the ability to rewind. Postgres can simulate this badly; Kafka does it natively.
Sustained throughput over ten thousand events per second with low producer-side latency requirements.
You already have a team running Kafka. This is the most honest reason and the one nobody says out loud.

If none of those apply, you do not need Kafka. You need a jobs table, a small daemon that polls it, and a Sunday afternoon.

The point isn’t that Kafka is bad. It’s that the boring option is usually load-bearing in ways the exciting option won’t be for another two years. By then the team that picked boring shipped twice.

Every agent should have a passport

Wed, 13 May 2026 00:00:00 GMT

We onboard a junior engineer with an email, an MFA token, a scoped role, and a manager who has to sign off when they touch production. Then we deploy an LLM agent with a system prompt and an API key and wonder why audit logs look like a haunted house.

We have spent fifteen years arguing about identity for humans. Every security conference has a slide on it. Every IAM team has a whiteboard with WHO IS DOING THIS THING? underlined twice.

Then we shipped agents. The agent has no employee ID. No role. No manager. It runs under the access of whichever engineer set the env var. When it does something wrong, the closest thing we have to attribution is a Slack channel called #incidents.

What a passport is

Four things, at minimum:

An identity that isn’t borrowed. Not the dev’s account. Not “the key in the env file”. A first-class identity in your IAM, with rotation and revocation built the same way they are for humans.

A scope. Right now most agents carry the equivalent of a diplomatic passport — they go anywhere, talk to anything, write to any bucket. We give them this because permissions are tedious to wire up. Then we are surprised when one of them drops the wrong table.

A stamped trail. Every action the agent takes leaves a record that survives outside its own context window. The model can hallucinate what it did. The audit log cannot.

A consul. Someone — human or a higher-trust agent — who authorizes the actions that matter. Reads flow. Writes ask. Money moves on paper.

The two objections you’ll hear

The first is that an agent is “just a script”, and scripts don’t need passports. This is two years out of date. The script doesn’t write its own instructions; the agent does. The script can’t be socially engineered through a forwarded email; the agent can.

The second is that all this is “too heavy for prototypes”. Sure. Most things are too heavy for prototypes. But the point of the passport metaphor isn’t to slow the agent down. It’s to keep you from retrofitting identity onto code that was never written with it in mind.

We did that once already. Fifteen years of human identity bolted onto systems that originally assumed every user was the operator. It was expensive. Most of us hated it.

We don’t have to do it twice.

Pricing pages are written for people who won't buy

Mon, 11 May 2026 00:00:00 GMT

I spent two weeks rewriting our pricing page. Conversion didn’t move. What moved conversion was rewriting the sales email — the document everyone who actually paid had already read by then.

This was embarrassing to admit. The pricing page is the thing every founder optimises because it’s public, trackable, and feels load-bearing. The sales email is the thing every founder neglects because it’s per-account, dull, and not something you can A/B test cleanly. Both are documents about price. Only one of them is read by people who buy.

What a pricing page is actually for

Three audiences, in the order they’re usually weighted wrong:

People who will never buy. The biggest visitor cohort. They came from a tweet. They will spend nine seconds. The page exists for them mostly to be skim-able and not embarrassing.
People who already decided to buy and need a number. They want the price. They want to know if you charge per-seat or per-usage. They do not want a “talk to sales” CTA in the way of that.
People who almost decided to buy but lost momentum. Smallest cohort. They came back from a thread, a Slack rec, a saved tab. They need one more reason. This is where pricing-page copy can earn money — and where most pages just repeat the homepage.

What actually converts

For B2B SaaS at a small scale, the document that converts is almost never the pricing page. It’s:

An onboarding email that’s specific to what they signed up for
A pricing rationale paragraph in the trial-end email that addresses why-not-cheaper questions
A “here’s what’s included in your tier” page that exists inside the product, after login
A short PDF a customer can forward to their finance team

None of these are public. None of them get blog posts written about them. All of them outperform the pricing page for actual revenue.

What I’d do differently

Spend the two weeks rewriting your highest-traffic post-signup email. Read the actual replies your customers send when they ask “what does this cost.” Write the pricing page last, and make it small.

The 3-person team is the new 50-person team

Sat, 09 May 2026 00:00:00 GMT

A friend’s startup did four million in revenue last year with three engineers, one designer, and a contractor who handles taxes. Their main competitor did six million with forty-two people. The competitor has more meetings, more brand, and a head of marketing whose entire job is justifying the head of marketing. My friend’s team has none of that. They also have 5.7× the revenue per person.

This number — 5.7× — keeps showing up. There was a Hacker News thread earlier this year with hundreds of CTOs in the comments arguing about it. The lean AI-native startups average $3.48M revenue per employee. Traditional SaaS averages $610K. The gap is not a rounding error. The gap is the entire game.

What changed

The piece of work that used to fund a full headcount — design, then frontend, then backend, then DevOps, then docs, then support — now fits in a senior engineer’s afternoon. Some of it is AI doing the boring parts. Some of it is the boring parts having been deleted, because AI made the alternative cheap enough that we stopped pretending the boring parts mattered.

This will sound triumphant if you are the senior engineer in question. It will sound bleak if you were the docs hire. Both reactions are correct. The story is not “AI made everyone more productive”. The story is “AI made the people in the middle redundant, and the people at the edges twice as valuable as before.”

What you should be hiring like

If you are starting something now, your first hire is a person who can do two adjacent roles competently and a third one passably. Design Engineer. Eng-PM. Ops who can write Rust. The hand-off mode — designer hands Figma to engineer, engineer hands ticket to QA — is the most expensive line item on your P&L and you don’t know it because the bill is paid in calendar invites.

If you are forty-two people already, the news is harder. You can’t refactor your way down to three. But you can refactor your way down to twelve, and twelve is enough to win a market that used to require fifty.

What you should be careful about

Three-person teams beat fifty-person teams on revenue per head. They lose on three other things that matter: institutional memory, hiring depth, and the ability to absorb a single quitting.

The honest take is that the small team is a higher-variance bet. Most three-person teams die. The ones that don’t are the comparison everyone uses, which is survivorship bias dressed up as a business model.

So: bet small if you can carry the variance. Hire wider if you can’t.

Your prompt isn't cache-shaped

Fri, 08 May 2026 00:00:00 GMT

I asked a team what their prompt cache hit rate was. The tech lead said “we cache responses, right?” That isn’t what cache means here, and that single misunderstanding was costing them about fourteen thousand dollars a month.

Prompt caching has been generally available across the major providers for over a year. The discount is real — typically ninety percent off cached tokens after the first hit, sometimes more. The catch is that almost nobody designs their prompts to be cacheable, so almost nobody gets the discount.

The rule is mechanical: caches match by prefix, not by content. If the first token of two requests differs, neither hits cache. If you put a timestamp at the top of the system prompt — “Today is May 19, 2026” — every single call is a cold cache. If you interleave user history before the system prompt, every conversation is a cold cache. The expensive bits — the system prompt, the tool definitions, the few-shot examples — are exactly the bits you want frozen at the start of every request.

What cache-shaped looks like

Three rules cover most of it:

Static block first. System prompt, role, formatting rules, tool definitions, few-shot examples — all of it before anything that varies. Order matters. A 4 KB static prefix that changes once a quarter is an asset; one buried under the user message is dead weight.
Dynamic block last, and clearly delimited. User input, current state, retrieved context — at the end, in a single block, so the cache boundary is obvious to humans reviewing the prompt later.
No clocks, no UUIDs, no random salts in the prefix. They look harmless. They cost you the entire month’s caching benefit.

The diagnostic

Look at your cache hit rate. If it’s under seventy percent on a high-volume agent, your prompt isn’t cache-shaped. The fix is usually a one-day refactor of the prompt template plus a tweak to your retry/log code so it doesn’t perturb the prefix.

The savings won’t show up as a dramatic moment. They’ll show up as next month’s bill being a third of last month’s, and nobody noticing because the dashboards weren’t tracking it. Track it.

The cheapest dependency is the one you delete

Wed, 06 May 2026 00:00:00 GMT

I removed a logging library from our build last quarter. The change was forty lines of deletes and one new helper function. Build time dropped twelve seconds. Bundle dropped four hundred KB. The on-call rotation forgot it existed. We had been paying rent on it for three years, and the only thing it did better than console.log was come with a JSON schema and a security advisory every six months.

The cheapest dependency is the one you delete. This is so obvious it sounds like a fortune cookie. It also describes work that almost nobody does, because the work has no story attached to it. There is no Friday demo for “I removed a thing.” There is no LinkedIn post for “I subtracted.” The marginal hire optimises for what they can add to a codebase. The codebase optimises for what it can remove.

Why deletions don’t happen

Three reasons, in rough order of which is the most embarrassing:

Nobody owns it. The person who installed the dependency is gone. The person inheriting it isn’t sure if removing it is safe. The middle option — leaving it — never gets second-guessed in a code review.
The shell game of “we might need this someday.” We don’t. We don’t. We will not. The thing that justifies keeping a library “for the future” almost always turns out to be a five-line function you write fresh when the future arrives.
Deletion looks like nothing. The PR is small. The diff is mostly red. There is no feature attached. Reviewers approve it slower than feature work because there’s nothing visible to validate against.

What it costs you

Each dependency is a small ongoing tax: surface area for CVEs, time spent on version-pin bumps, build slowness, mental load when reading the codebase (“what does this import do?”). Most of these are too small to feel individually. They aggregate.

A team I worked with took two engineers’ time for a week to do nothing but subtract. They removed eighteen npm packages, two Postgres extensions, and an entire microservice. They didn’t ship a feature that week. They got two minutes back on every build for a year.

It is a real piece of work. It is rarely a glamorous one.

You're optimizing the wrong axis of agent cost

Mon, 04 May 2026 00:00:00 GMT

I watched a team spend a quarter dropping their per-call latency by 40%. The bill kept going up. They were proud of the latency win — the dashboards were beautiful — but they were optimizing the wrong axis. Their bottleneck was not speed. It was that they were using a frontier model to do regex.

There are three axes most agent fleets can be tuned on. In rough order of how often teams pick the wrong one:

Latency — how fast a single call returns.
Accuracy — how often the call is right.
Routing — which model handles the call in the first place.

Teams obsess over latency and accuracy, because they’re the axes the dashboard renders. The axis that actually moves the bill is routing.

The shape of the actual cost

Run a profiler on your agent fleet. Not a latency profiler — a token profiler that tells you how many tokens are flowing through which model at which price tier.

The shape will surprise you. In every fleet I have looked at, somewhere between 60% and 80% of the volume is doing things that an old GPT-3.5-class model could have done for one-twentieth the cost. Format a response. Pick from a list. Confirm a yes/no. Restate a parameter. The frontier model is invoked because the team didn’t want to write a router, and the router is what would have saved them.

What good routing looks like

A good router is boring. It’s a function. The function reads the request and decides which model gets it. Sometimes it’s a small finetune. Sometimes it’s a regex. Sometimes it’s a single line of code that says if intent in {"yes", "no"}: return cheap_model.invoke(prompt).

A bad router is “we’ll add caching later”. Caching is not routing. Caching helps when you ask the same question twice. Routing helps when you ask different questions of different difficulty. The bill is shaped by the second case, not the first.

The diagnostic

Open last month’s API invoice. Divide the total cost by the number of agent invocations. If the number is bigger than half a cent, you are probably calling a frontier model for things that did not need one.

If the number is bigger than five cents, your fleet is on fire and you are paying for the smoke.