The Overconsumption Stack: Why Your AI Features Cost 10x What They Should

Jordan Saunders · Jun 24, 2026

The Overconsumption Stack: Why Your AI Features Cost 10x What They Should

Most AI features in production today are overpaying for intelligence the same way a teenager overpays for car insurance. Not because of one big mistake, but because of a stack of small defaults nobody went back to question.

In the last article I made the case that AI economics break the forecasting models finance teams rely on, because agentic systems choose their own token budget at runtime. This one is for the people who have to fix it. We build AI-enabled software for mid-market companies, and when we get called in to look at a feature whose costs do not make sense, we almost always find the same five defaults stacked on top of each other. I have started calling it the overconsumption stack.

Part of the AI economics series. In the previous post, the case was made that AI economics break the old capacity planning model. This one is for the people who have to fix it.

The Five Defaults

Default 1: An agent where a function would do. Somewhere along the way, reaching for an agent became the reflex. But most of what gets shipped as an agent is a workflow wearing a costume. If the steps are known in advance, write the steps. Deterministic code is free, instant, and never hallucinates. The model should be reserved for the points in the flow where judgment is genuinely required, which in most business software is one or two places, not twelve.

Default 2: A frontier model for every call. Teams prototype on the biggest model because it is the fastest way to prove the idea works, which is correct. Then the prototype ships and nobody revisits the choice. Classifying a support ticket does not require the same intelligence as drafting a legal summary, and pricing between model tiers differs by an order of magnitude or more. Route by task difficulty. Most production traffic is easy.

Default 3: No caching. A surprising share of inference spend is the same questions getting answered over and over. Users phrase things similarly, workflows repeat, documents get reprocessed. If you are not caching at the prompt level and reusing results where inputs have not changed, you are paying retail for answers you already bought.

Default 4: Unbounded behavior. No cap on agent loops, no ceiling on retries, no maximum context size. Everything works fine in the demo, and then one weird input sends the system into a spiral that burns more in an afternoon than the feature normally costs in a month. Every AI feature needs a budget and a circuit breaker, per request and per day. This is not exotic — it is the same thinking as rate limits and timeouts, applied to a new resource.

Default 5: No cost attribution. Most companies can tell you their total API bill and nothing else. Costs are not tagged by feature, so no one can say which product surface is expensive, whether a prompt change moved the number, or which customer segment drives the tail. What does not get measured does not get managed, and inference is currently the least measured line item in software.

Why This Is a Moat

None of this is glamorous, which is exactly why it works as a moat. Everyone has access to the same models now. Your competitor can call the same API you can, tomorrow, with a credit card. What they cannot copy overnight is an engineering culture where cost is a first-class metric next to latency and errors, where prompts are versioned and reviewed like code, where evals run in CI so model swaps are routine instead of terrifying, and where every feature knows what it is allowed to spend. That operational layer takes quarters to build and shows up directly in gross margin.

We have seen this pattern before. The companies that won the cloud era were not the ones with the most exotic architectures. They were the ones that brought boring operational discipline to a new platform while everyone else was getting surprise bills. The same sorting is happening with AI right now, and it is happening faster, because the spend scales with model ambition rather than server count.