There is a calculation that almost every team building AI-powered systems eventually makes. They look at the pricing page, see that the capable frontier model costs ten—sometimes a hundred—times more per token than the smaller one, and reach for the cheaper option. It is an entirely rational instinct. But it is one that, more often than not, turns out to be wrong.

The debate between using powerful large language models and using their cheaper, smaller counterparts is one of the most consequential decisions in AI engineering right now. Get it right and you can build systems that are both effective and economical. Get it wrong and you end up in an expensive loop of prompt engineering, iteration, and debugging that ultimately costs more—in money, in time, and in delayed value—than the premium model would have in the first place.


The Surface-Level Arithmetic

The appeal of cheaper models is obvious. The price gap between frontier models and their lightweight counterparts is enormous and growing. At the top end, models like Claude Opus 4 or GPT-5.5 at maximum reasoning effort cost orders of magnitude more per million tokens than their smaller siblings. Claude Haiku 4.5 sits at roughly $1 per million input tokens and $5 per million output tokens. Frontier reasoning models can run to $75 or more. On a spreadsheet, the case for choosing the cheaper model looks compelling.

But this calculation makes a critical error: it counts only the model cost, and not the full cost of making the model useful.


What You Are Actually Paying For

When you choose a more powerful model, you are not simply paying for tokens. You are paying for something harder to quantify: the model’s capacity to understand ambiguous instructions correctly on the first attempt, to recover gracefully from unexpected situations, to maintain coherent reasoning across long and complex chains of thought.

OpenAI’s own guidance on model selection is unusually direct on this point. Their recommended workflow is explicit: “Start with the most capable model available to achieve your accuracy targets.” The reason is simple—if a model cannot hit your accuracy target, questions of cost and latency are moot. You are paying for the wrong thing entirely.

Anthropic takes a similar position. Their documentation on building effective agents notes that “the autonomous nature of agents means higher costs, and the potential for compounding errors,” and recommends routing easy, routine questions to smaller cost-efficient models while reserving capable frontier models for hard or unusual ones. The implicit acknowledgement here is significant: not all tasks can be handled by the cheaper model, and knowing which is which requires engineering judgement that itself has a cost.


The Hidden Costs of Choosing Small

The real expense of using a less capable model does not appear on your API invoice. It appears in your engineers’ time, in your system’s architecture, and in the reliability of your product.

The Prompt Engineering Tax

Weaker models require more explicit, more carefully constructed prompts to produce acceptable results. Every edge case that a capable model would handle intuitively must be spelled out. Instructions that would be implicit must become explicit. What starts as a clean, readable system prompt gradually becomes a sprawling document full of special cases, worked examples, and increasingly desperate attempts to pre-empt every way the model might misinterpret a request.

Teams working with production AI systems have documented this pattern vividly. The real estate AI assistant “Lucy”, deployed by Rechat, became a case study in what happens when prompt engineering is used as a substitute for model capability. The team described what happened as “a game of whack-a-mole“—fixing one failure mode caused others to emerge. Prompts expanded into “long and unwieldy forms, attempting to cover numerous edge cases and examples.” There was, they noted, “limited visibility into the AI system’s effectiveness across tasks beyond vibe checks.

This is not an unusual experience. It is the natural consequence of trying to compensate in software for what the model lacks in capability.

The Evaluation Paradox

Even if you succeed in coaxing acceptable output from a cheaper model, you now need to verify that it is acceptable. This requires an evaluation pipeline—a system for testing outputs, catching regressions, and identifying failure modes. And here you encounter a particularly frustrating irony: the model best suited to evaluate complex outputs is a powerful, capable model. As AI consultant Hamel Husain puts it, building effective evals means using “the most powerful model you can afford” for critique tasks, because “it often takes advanced reasoning capabilities to critique something well.”

So you end up in a situation where you are running a cheap model in production and an expensive model in your evaluation pipeline. The cost savings from the production model are partially offset by the cost of the evaluation infrastructure—and that infrastructure itself requires ongoing engineering effort to maintain.

The Structured Output Tax

Production AI systems almost always require structured, predictable output: JSON objects, tool calls, database queries. Weaker models fail more often on these constraints, producing malformed output that breaks downstream systems. This means adding retry logic. Validation layers. Fallback mechanisms. Each of these is engineering work that would not be necessary with a more capable model that reliably produces correct structured output in the first place.


Where the Maths Gets Alarming: Agentic Workflows

All of the above concerns apply to simple question-and-answer interactions. In agentic workflows—where a model takes a sequence of actions, uses tools, and makes decisions over multiple steps—the stakes are dramatically higher.

Anthropic’s research into agentic systems makes the mathematics of this painfully clear. Consider a model that succeeds on 90% of individual steps in an agentic chain. Over a ten-step workflow, the probability that all steps succeed is not 90%—it is approximately 35%. Move that per-step accuracy to 95%, which typically requires a more capable model, and the ten-step success rate jumps to 60%. The improvement in overall task completion is far larger than the raw improvement in per-step accuracy would suggest.

This is why model capability matters disproportionately in agentic contexts. A small improvement in the model’s reliability at each step produces a dramatic improvement in end-to-end task success. And a failure mid-chain does not just produce a wrong answer—it potentially triggers recovery logic, retries, escalations, and human review, all of which consume tokens, time, and engineering effort far in excess of what the original task required.

There is also what might be called the token cascade effect. A more capable model typically produces more precise, more concise outputs. A weaker model may need multiple iterations to converge on an acceptable answer, burning tokens throughout. Research from HuggingFace has noted that evaluator models tend to favour verbose outputs even when briefer ones are more correct—meaning weaker models often produce longer outputs even when those outputs are of lower quality. More tokens, worse results.


The Case for Cheaper Models (When It Actually Works)

To be fair, the case for smaller models is not without merit—it simply requires conditions that are often more demanding than teams initially expect.

The most convincing scenario is high-volume, well-defined, narrow-scope tasks, particularly when those tasks can be solved with fine-tuning. OpenAI’s model selection guide describes a compelling case study: a fake news classification task where GPT-4o zero-shot achieved 84.5% accuracy at $1.72 per thousand articles—below the target accuracy. Fine-tuning a much smaller model (GPT-4o-mini) with 1,000 labelled examples produced 91.5% accuracy at $0.21 per thousand articles. Equivalent performance, less than 2% of the cost.

The lesson from that example is not that the small model was better. It is that the small model, once fine-tuned on the specific task with high-quality examples from the frontier model, became equally capable for that specific task. The key elements—specific domain, well-defined success criterion, enough training data, engineering investment—are not always present. But when they are, the economics genuinely work.

Anthropic’s own Claude Haiku 4.5 is a striking data point. On SWE-bench Verified, a demanding software engineering benchmark, Haiku 4.5 achieves 73.3% accuracy. Real customers report it achieving 90% of Claude Sonnet 4.5’s performance on their production workloads, while running 4-5 times faster at a fraction of the cost. One customer, Gamma, reported that Haiku 4.5 actually outperformed their premium-tier model on instruction-following for slide generation—65% accuracy versus 44%. This is not a theoretical result; it is a production outcome.

The pattern that emerges is consistent: smaller models work well on tasks that are structured, verifiable, and narrow in scope. Code generation, SQL queries, document classification, information extraction. Where the task has clear success criteria and can be evaluated automatically, smaller models with appropriate fine-tuning can match frontier performance at dramatically lower cost.


The Capability Gap That Matters

Where the argument for cheaper models breaks down is precisely where the stakes are highest: complex reasoning, ambiguous real-world tasks, and situations that require genuine judgement.

Academic benchmarks make this concrete. The Artificial Analysis Intelligence Index, which evaluates models across a range of demanding tasks including graduate-level science questions, hard mathematics, and long-horizon agentic work, shows a substantial and persistent gap between frontier models and their smaller counterparts. The gap is largest on exactly the kinds of tasks that matter most in production: tasks involving genuine reasoning under uncertainty, tasks with multiple valid approaches, tasks where a wrong answer can have significant consequences.

Research into inference scaling—the idea that you can compensate for a smaller model by having it think longer, sample more solutions, or use tree-search algorithms—offers partial relief. A 2024 paper from Wu et al. showed that smaller models paired with advanced inference strategies could match larger models on mathematics and coding benchmarks with verifiable answers. This is real and useful. But the caveat is significant: it works for tasks where you can verify the answer. For open-ended generation, nuanced analysis, or tasks requiring broad world knowledge, you cannot simply make a weaker model think harder and expect it to match a stronger one.


The Moving Target Problem

There is one genuinely compelling argument for choosing smaller models that deserves serious consideration: the pace of improvement.

Today’s Claude Haiku is, on many benchmarks, comparable to Claude Opus from eighteen months ago. Dropbox CEO Andrew Filev observed that Haiku 4.5’s performance “would have been state-of-the-art on our internal benchmarks just six months ago.” The distillation of frontier capabilities into smaller, cheaper models is accelerating. The model that is too limited for your use case today may be entirely adequate in six months.

This creates a legitimate reason for some teams to choose smaller models, even for demanding tasks: not because the model is currently capable enough, but because the capability gap is closing fast and the economics of accepting slightly lower quality now may be favourable if you expect to re-evaluate model choices regularly.

The counter-argument is that frontier models are also improving, and the relative gap between frontier and cheap may not close as quickly as absolute performance numbers suggest. But the trajectory is real, and any analysis of model economics should account for it.


The Verdict: A False Economy in Most Cases

My own view, formed from examining the evidence, is that the instinct to choose a cheaper model is usually a mistake—not always, but usually.

For the majority of teams building real AI-powered systems, the practical recommendation should be:

Start with the most capable model you can justify. Not because cost does not matter, but because capability is the prerequisite for everything else. A model that cannot reliably perform the task does not become economical simply because it is cheap to run.

Establish what good looks like before optimising for cost. OpenAI’s workflow is sensible: use the frontier model to define your accuracy target and generate high-quality outputs. Then—and only then—test whether a smaller model or fine-tuned model can match that performance at lower cost.

Take full-stack costs seriously. The token price is the smallest part of the cost of building AI systems. Engineering time, prompt iteration, evaluation infrastructure, retry logic, and human review of failures are all costs that scale inversely with model capability. The relationship is not linear; weaker models do not just require a little more work—they can require a qualitatively different and more complex system architecture.

In agentic workflows, do not compromise on capability. The mathematics of compounding errors means that per-step accuracy improvements translate into disproportionately large improvements in end-to-end task success. This is precisely where the cost of using a less capable model is most likely to exceed the cost of the model itself.

The economic reality of AI development in 2026 is not that powerful models are expensive and cheap models are cheap. It is that the total cost of a system built on an inadequate model—in engineering time, in iteration cycles, in production failures, in delayed delivery—routinely exceeds the premium you would have paid for the right model at the outset.

The cheapest model is rarely the most economical choice. Recognising that distinction early, before the whack-a-mole begins, is one of the most valuable judgements a team building AI systems can make.

Comments