June 20, 2026·12 min read·by Super Admin

Production LLM Cost Optimization: Model Choice, Prompt Caching & Output Engineering

Every LLM call in a hot path is a recurring bill — and cutting it usually means quietly trading away accuracy, speed, or insight. This is a practical guide to the four-way trade behind LLM cost optimization, and the three techniques — model choice, prompt caching, output engineering — that move it on purpose. With a worked, measured example.

The four-way trade

An LLM call on a hot path is not a feature you pay for once. It's a recurring bill — re-paid in tokens, latency, and someone else's rate limit on every single request. So "make it cheaper" is never the real goal. The real goal is to move cost without breaking the three things that quietly pay for cheapness: performance, accuracy, and observability.

Almost every LLM optimization is, underneath, a trade between these four. Picture them as the corners of a square. Optimization is moving one corner without collapsing the others:

graph TD
    Opt(("Optimize — pull cost down, hold the rest"))
    Cost["💰 Cost"]
    Perf["⚡ Performance"]
    Acc["🎯 Accuracy / Precision"]
    Obs["🔍 Observability"]
    Cost --- Opt
    Perf --- Opt
    Acc --- Opt
    Obs --- Opt
    style Opt fill:#4f8cf7,color:#fff
    style Cost fill:#f7c47f,color:#000
    style Perf fill:#7fc4c4,color:#000
    style Acc fill:#7fc47f,color:#000
    style Obs fill:#cf7fcf,color:#fff

Corner	What it means	How you measure it
💰 Cost	tokens × price, per unit of value	$ / request and $ / useful result; split by token class
⚡ Performance	can it keep up with the load?	rate limit (req/s), max concurrency, latency
🎯 Accuracy / Precision	is it still right?	recall (caught the real ones) + precision (flagged only real ones)
🔍 Observability	can you see why it decided?	per-decision explanation / structured trace

The discipline is the whole point: decide which corners are inviolable, which you'll spend, and prove the trade with numbers — instead of swapping models, eyeballing a few outputs, and hoping.

Four techniques move the square, in order of payoff — the first three shrink the cost of each call; the fourth changes how calls are scheduled. Here's the map; the rest of the guide is one section each.

Technique	The lever	💰 Cost	⚡ Perf	🎯 Accuracy	🔍 Observability
1. Model choice	the model + its tier	⬇⬇⬇	⬆⬇	↔	—
2. Cache-aware prompt	prompt geometry (input tokens)	⬇⬇	↔	↔ (free)	—
3. Output-aware design	what you ask it to emit (output tokens)	⬇⬇	↔	↔	⬇ (traded)
4. Batching + scheduler	how requests are scheduled (throughput)	⬇⬇	⬇ (latency)	↔	↔

The running example. Throughout, one real pipeline grounds each idea: Guliel scans a customer's connected inbox for expenses. One LLM call per email decides "is this an expense, and what are the numbers?" — it runs on every email, so it's the textbook hot-path call. Every figure below is measured on the same 465 real emails with real published prices.

graph LR
    Inbox["📥 inbox"]
    FE["fetch"]
    DE["classify + extract (the LLM call)"]
    OE["OCR attachment"]
    CR["create expense"]
    Skip["skip"]
    Inbox --> FE --> DE
    DE -->|"expense + file"| OE --> CR
    DE -->|"expense in body"| CR
    DE -->|"not an expense"| Skip
    style Inbox fill:#f7c47f,color:#000
    style FE fill:#4f8cf7,color:#fff
    style DE fill:#7fc47f,color:#000
    style OE fill:#cf7fcf,color:#fff
    style CR fill:#4f8cf7,color:#fff
    style Skip fill:#9a9a9a,color:#fff

Rule zero: you can't trade what you can't measure

Before any technique, build the instrument — a test that scores all four corners on the same run, over real inputs. Skip this and you're not optimizing, you're guessing. A real test captures four things; most capture none:

Capture	Why it's not optional
Token split by price class	"Tokens" is three numbers priced ~100× apart: cache-hit input, cache-miss input, output. A flat rate hides the biggest lever.
Cost per unit of value	$ / request lies. Divide by the thing the user wants — $ / captured result. A cheap call that misses half is expensive.
Precision + recall	Cost without quality is meaningless. Score it against a trusted baseline, on the same run.
Throughput	Cheap-per-token but capped at 0.25 req/s is expensive in wall-clock. Concurrency is a result, not a footnote.

Cost is a dot product of the token split with the price sheet — never one number times one rate:

type Usage = {
  prompt_cache_hit_tokens: number;   // input you already paid to compute → cheap
  prompt_cache_miss_tokens: number;  // input computed fresh → full price
  completion_tokens: number;         // output (a "reasoning" model thinks in these)
};

const cost = (u: Usage, price: { hit: number; miss: number; out: number }) =>
    u.prompt_cache_hit_tokens  / 1e6 * price.hit   // e.g. $0.0036 / 1M
  + u.prompt_cache_miss_tokens / 1e6 * price.miss  // e.g. $0.435  / 1M  ← ~120× the hit
  + u.completion_tokens        / 1e6 * price.out;  // e.g. $0.87   / 1M

Two traps will fake your numbers — both cost me a wrong conclusion before I caught them:

🪤 Cache contamination. Prompt caches persist between runs. Re-run the same test prompts and they cache at ~100% — a gorgeous number that has nothing to do with production, where every input is new. Fix: prepend a unique per-run nonce so each candidate measures its true caching.
🪤 Non-determinism. A reasoning model ignores temperature=0 and re-samples every call, so a single before/after can't tell a real change from a coin flip. Fix: an A/A run — the identical prompt twice — gives you a noise floor. A change is real only if it clears that floor.

Technique 1 — Model choice: the biggest lever, and it isn't only price

A model is a single point in three dimensions: price, throughput, and quality. The cheapest token loses if the model can't keep up or can't tell a receipt from a newsletter. So pick by your binding constraint, not the price-per-token column:

If your bottleneck is…	optimize for…	watch out for…
Bursty volume / a queue draining	throughput (concurrency, req/s)	premium models that are fast and pricey
Steady high volume	price per token	cheap models that rate-limit
Ambiguous inputs	precision	cheap models that over-flag

In practice. Our scan's model was swapped twice — and the cost was non-monotonic (it went up before it came down), because each switch fixed one corner and surrendered another:
model
$ / mailbox
throughput
quality
why we moved
mistral-large (original)
~$0.66
~0.25 req/s ⚠
over-flags
too slow — 50% of calls 429'd
glm-5.1
~$1.64
10 concurrent
over-flags
fixed speed, 3.7× the cost
deepseek-v4-pro (now)
~$0.40
500 concurrent
best
first to win cost and speed
The predecessors flagged more email — but the extras were look-alikes ("rate your hotel stay," "your trial is ending"): lower precision, not better recall. Where it counted, all three extracted the same amounts. The upgrade traded nothing real away.

model	$ / mailbox	throughput	quality	why we moved
mistral-large (original)	~$0.66	~0.25 req/s ⚠	over-flags	too slow — 50% of calls 429'd
glm-5.1	~$1.64	10 concurrent	over-flags	fixed speed, 3.7× the cost
deepseek-v4-pro (now)	~$0.40	500 concurrent	best	first to win *cost and* speed**

Takeaway: measure all three axes at once. A model-choice "win" on price that silently loses throughput or precision is a loss you'll discover in production.

Technique 2 — Cache-aware prompts: stop re-paying for your own instructions

Providers cache the longest identical leading prefix of a request and bill it at the hit rate — often ~100× cheaper than a fresh miss. One rule follows, and it's geometric:

Every token before the first byte that changes between two calls can be nearly free. Every token after it is full price.

Most prompts are written for the human reading them, so they splice per-request data into the middle of the instructions — and every splice is a cache guillotine that re-bills the identical rules after it on every call:

The cache boundary sits wherever the prompt first changes between calls. Green = cached at the hit rate; red = re-billed at full price, every call. A per-request line in the middle drags the boundary up; moving it to the end pushes the boundary down — caching the whole rule block. Same words, only order:

graph LR
    subgraph BEFORE["❌ Before — a mid-prompt variable cuts the cache early"]
        direction TB
        b1["Rules — part 1"] -->|"✂ cache breaks"| b2["Attachments: receipt-4823.pdf"] --> b3["Rules — part 2"] --> b4["The input"]
    end
    subgraph AFTER["✅ After — whole instruction block static, variables last"]
        direction TB
        a1["Rules — part 1"] --> a2["Rules — part 2"] -->|"✂ cache breaks"| a3["Attachments: receipt-4823.pdf"] --> a4["The input"]
    end
    style b1 fill:#7fc47f,color:#000
    style b2 fill:#f77f7f,color:#fff
    style b3 fill:#f77f7f,color:#fff
    style b4 fill:#f77f7f,color:#fff
    style a1 fill:#7fc47f,color:#000
    style a2 fill:#7fc47f,color:#000
    style a3 fill:#f77f7f,color:#fff
    style a4 fill:#f77f7f,color:#fff

"After" caches its entire rule block on every call; "before" re-pays for half of it, forever — with byte-identical behavior either way.

What it looks like in code. Two interpolations sink most prompts — and both are things you'd write without a second thought:

// ❌ BEFORE — two everyday interpolations, both in the wrong place
function buildPrompt(email: Email): string {
  return `Current date: ${today()}
You are an expense classifier. Rules:
[ ...hundreds of tokens of rules, identical on every call... ]
${email.hasAttachments ? `Attachments: ${email.attachmentNames.join(", ")}` : ""}
[ ...more rules... ]

Email:
${email.body}`;
}

Two cache-killers hide in plain sight:

Current date: at the top changes daily, so the prefix differs from byte one — the entire prompt, rules and all, misses the cache every single day. This is the most common version of the mistake: Today is {date} pinned to the top of a system prompt.
The hasAttachments ternary in the middle cuts the cache right there — every rule after it is re-billed on every call, even though it never changes.

The fix is to make the instruction block a constant and let every variable fall to a tail after it:

// ✅ AFTER — one constant instruction block; all variables in the tail
const SYSTEM = `You are an expense classifier. Rules:
[ ...all the rules, identical on every call... ]
- If attachments are listed below, treat the email as likely financial.`;

function buildPrompt(email: Email): string {
  return `${SYSTEM}

Current date: ${today()}
Attachments: ${email.attachmentNames.join(", ") || "none"}
Email:
${email.body}`;
}

SYSTEM is byte-identical on every call, so it caches in full; today(), the attachment list, and the body — the only things that vary — sit after the cached prefix, costing what they should and nothing more.

In practice. Reordering alone lifted the input cache-hit rate 41% → 51% and cut cost 9%, with provably identical output. The cheapest win in the whole guide: you're not asking the model to do anything different — you're letting the cache do its job.

This technique is "free" on the square: cost ⬇, accuracy and observability untouched. Always do it first.

Technique 3 — Output-aware design: the cost you forget, traded for the insight you don't need

Caching attacks the input. But the input is rarely where the money is — the output is, and you can never cache it. On a reasoning model the bill is dominated by reasoning_tokens (the model "thinking," billed at the full output rate). You can't cache thinking. You can only ask for less.

This is the one technique that touches the Observability corner — so it's where the trade gets explicit. Two cuts:

Drop fields the consumer never reads. We asked for a human-readable reason on every call. On a reasoning model the rationale already exists in the thinking phase — re-emitting it in the answer just re-pays for it in expensive output tokens.
Short-circuit the common case. ~80% of email is not an expense. For those, every field but is_invoice: false is dead weight. Return only that.

In practice. Asking for the minimum didn't just shrink the answer — telling a reasoning model to be terse made it reason less: output −34%, reasoning tokens −29%, cost another −18%.

Here is the trade, stated honestly. Dropping reason cost us observability — that string was the per-email explanation in our scan log ("Marketing newsletter," "loyalty points"). Without it the log reads generic ("Not an expense"). The principle we used to decide:

Keep observability where it's diagnostic. Pay for it where it's load-bearing. Don't bill the user for it on the hot path.

Where	Keep `reason`?	Why
Internal eval / debugging a sample	✅ yes	we're reading it, on our own dime
Production, every email, forever	❌ no	cost with no consumer — the purest waste

The harness can switch reason back on for internal runs. In production it was cost without a reader — so it went. That's the move: not "remove observability," but locate it — diagnostic detail belongs in the path you run a thousand times, not the one you run a billion.

Technique 4 — Batching: trade latency for half the bill

The first three techniques shrink the cost of each call. Batching changes the price of a call by changing how you send it: most providers run an asynchronous batch endpoint at roughly half price for work that doesn't need an answer this second.

Provider	Batch endpoint	Typical discount
Google Gemini	Batch Mode	~50% (2×)
OpenAI	Batch API	~50%
Anthropic	Message Batches	~50%
Self-hosted (vLLM / TGI)	continuous batching	max GPU utilization

The trade is latency, and it's a real one: a batch is only finished when every request in it is finished, so individual results arrive later. That's fatal for an interactive chat — and nearly free for anything asynchronous: a nightly job, a queue drain, an inbox scan. If no human is watching a spinner, batching is close to a pure cost win.

The scheduler is where it lives. If your pipeline already rate-limits its LLM calls through a queue — and at production volume it should — that queue is the natural place to batch. Instead of firing one call per item, the scheduler collects items until it has N of them (or T milliseconds pass), sends one batch request, and fans the results back out:

graph LR
    R1["request"] --> Q
    R2["request"] --> Q
    R3["request"] --> Q
    Q["scheduler — collect until N items or T ms"] --> B["one batch call — ~½ price"]
    B --> O1["result"]
    B --> O2["result"]
    B --> O3["result"]
    style R1 fill:#f7c47f,color:#000
    style R2 fill:#f7c47f,color:#000
    style R3 fill:#f7c47f,color:#000
    style Q fill:#7fc4c4,color:#000
    style B fill:#4f8cf7,color:#fff
    style O1 fill:#7fc47f,color:#000
    style O2 fill:#7fc47f,color:#000
    style O3 fill:#7fc47f,color:#000

The same idea reaches past hosted APIs. On a self-hosted model, batching is how you actually use the GPU you're paying for: modern serving stacks (vLLM, TGI) do continuous batching out of the box — interleaving many requests through the forward pass instead of one-at-a-time — but the throughput you realize still depends on tuning it (batch size, max tokens, how long you'll wait to fill a batch). Idle compute is the most expensive token of all.

Where it fits. Batching is the lever for high-volume, latency-tolerant work — exactly the shape of an async pipeline like the inbox scan, where a worker already drains a rate-limited queue (the scheduler is right there, waiting to be taught to batch). It's the one technique here that spends performance for cost instead of getting it free — so it earns its place only where latency is slack. Where it is, ~2× is on the table.

Putting it together: what to trade, what to refuse

Stack the three per-call techniques and the per-mailbox cost walked down like this — while throughput went up and accuracy never moved (batching, the fourth lever, we'd reach for next as volume grows):

step	change	$ / mailbox
original	mistral-large	~$0.66 (rate-capped)
model	→ deepseek-v4-pro	~$0.52
cache	+ static-prefix prompt	~$0.47
output	+ minimized output	~$0.38

But the number isn't the lesson. The lesson is the posture you take on each corner — decided on purpose, proven with the test:

Corner	Our posture	The general rule
🎯 Accuracy	inviolable	a cheaper pipeline that loses correctness isn't cheaper — it's broken
⚡ Performance	kept — except latency	paid for concurrency to keep up; the one slice we'd spend is latency (batching), where it's slack
🔍 Observability	spent — surgically	forfeited on the hot path, kept in the lab
💰 Cost	the thing we pulled down	every multiple out is a multiple the user doesn't pay

Your feature may answer differently. A medical or legal classifier might keep every reason token and eat the cost, because there the explanation is the product. The framework doesn't tell you what to value — it forces you to price each thing you value, so the trade is a decision you made, not a cost you absorbed by accident.

And there's a reason to pull cost down that has nothing to do with margin: inference cost is priced into what users pay. Every multiple you remove is a multiple off their bill — a cheaper product, a wider audience, a system more people can afford to run. Optimization, done right, isn't how you keep more. It's how you let more people in.

If you're the kind of engineer who read this far, there's a part of Guliel built for you: a fully typed REST API and an MCP server, so you can wire your own financial operations — issue documents, pull reports, reconcile and scan expenses — straight from your own code or an AI agent. The same discipline that keeps our inference honest is what keeps that surface clean enough to hand to you.

Explore the Guliel API & MCP →

— Sapir

llmcost-optimizationproductionai-engineeringprompt-cachingobservabilityinference