Production LLM Cost Optimization: Model Choice, Prompt Caching & Output Engineering
Every LLM call in a hot path is a recurring bill — and cutting it usually means quietly trading away accuracy, speed, or insight. This is a practical guide to the four-way trade behind LLM cost optimization, and the three techniques — model choice, prompt caching, output engineering — that move it on purpose. With a worked, measured example.
The four-way trade
An LLM call on a hot path is not a feature you pay for once. It's a recurring bill — re-paid in tokens, latency, and someone else's rate limit on every single request. So "make it cheaper" is never the real goal. The real goal is to move cost without breaking the three things that quietly pay for cheapness: performance, accuracy, and observability.
Almost every LLM optimization is, underneath, a trade between these four. Picture them as the corners of a square. Optimization is moving one corner without collapsing the others:
graph TD
Opt(("Optimize — pull cost down, hold the rest"))
Cost["💰 Cost"]
Perf["⚡ Performance"]
Acc["🎯 Accuracy / Precision"]
Obs["🔍 Observability"]
Cost --- Opt
Perf --- Opt
Acc --- Opt
Obs --- Opt
style Opt fill:#4f8cf7,color:#fff
style Cost fill:#f7c47f,color:#000
style Perf fill:#7fc4c4,color:#000
style Acc fill:#7fc47f,color:#000
style Obs fill:#cf7fcf,color:#fffCorner | What it means | How you measure it |
|---|---|---|
💰 Cost | tokens × price, per unit of value | $ / request and $ / useful result; split by token class |
⚡ Performance | can it keep up with the load? | rate limit (req/s), max concurrency, latency |
🎯 Accuracy / Precision | is it still right? | recall (caught the real ones) + precision (flagged only real ones) |
🔍 Observability | can you see why it decided? | per-decision explanation / structured trace |
The discipline is the whole point: decide which corners are inviolable, which you'll spend, and prove the trade with numbers — instead of swapping models, eyeballing a few outputs, and hoping.
Four techniques move the square, in order of payoff — the first three shrink the cost of each call; the fourth changes how calls are scheduled. Here's the map; the rest of the guide is one section each.
Technique | The lever | 💰 Cost | ⚡ Perf | 🎯 Accuracy | 🔍 Observability |
|---|---|---|---|---|---|
1. Model choice | the model + its tier | ⬇⬇⬇ | ⬆⬇ | ↔ | — |
2. Cache-aware prompt | prompt geometry (input tokens) | ⬇⬇ | ↔ | ↔ (free) | — |
3. Output-aware design | what you ask it to emit (output tokens) | ⬇⬇ | ↔ | ↔ | ⬇ (traded) |
4. Batching + scheduler | how requests are scheduled (throughput) | ⬇⬇ | ⬇ (latency) | ↔ | ↔ |
The running example. Throughout, one real pipeline grounds each idea: Guliel scans a customer's connected inbox for expenses. One LLM call per email decides "is this an expense, and what are the numbers?" — it runs on every email, so it's the textbook hot-path call. Every figure below is measured on the same 465 real emails with real published prices.
graph LR
Inbox["📥 inbox"]
FE["fetch"]
DE["classify + extract (the LLM call)"]
OE["OCR attachment"]
CR["create expense"]
Skip["skip"]
Inbox --> FE --> DE
DE -->|"expense + file"| OE --> CR
DE -->|"expense in body"| CR
DE -->|"not an expense"| Skip
style Inbox fill:#f7c47f,color:#000
style FE fill:#4f8cf7,color:#fff
style DE fill:#7fc47f,color:#000
style OE fill:#cf7fcf,color:#fff
style CR fill:#4f8cf7,color:#fff
style Skip fill:#9a9a9a,color:#fffRule zero: you can't trade what you can't measure
Before any technique, build the instrument — a test that scores all four corners on the same run, over real inputs. Skip this and you're not optimizing, you're guessing. A real test captures four things; most capture none:
Capture | Why it's not optional |
|---|---|
Token split by price class | "Tokens" is three numbers priced ~100× apart: cache-hit input, cache-miss input, output. A flat rate hides the biggest lever. |
Cost per unit of value | $ / request lies. Divide by the thing the user wants — $ / captured result. A cheap call that misses half is expensive. |
Precision + recall | Cost without quality is meaningless. Score it against a trusted baseline, on the same run. |
Throughput | Cheap-per-token but capped at 0.25 req/s is expensive in wall-clock. Concurrency is a result, not a footnote. |
Cost is a dot product of the token split with the price sheet — never one number times one rate:
type Usage = {
prompt_cache_hit_tokens: number; // input you already paid to compute → cheap
prompt_cache_miss_tokens: number; // input computed fresh → full price
completion_tokens: number; // output (a "reasoning" model thinks in these)
};
const cost = (u: Usage, price: { hit: number; miss: number; out: number }) =>
u.prompt_cache_hit_tokens / 1e6 * price.hit // e.g. $0.0036 / 1M
+ u.prompt_cache_miss_tokens / 1e6 * price.miss // e.g. $0.435 / 1M ← ~120× the hit
+ u.completion_tokens / 1e6 * price.out; // e.g. $0.87 / 1MTwo traps will fake your numbers — both cost me a wrong conclusion before I caught them:
🪤 Cache contamination. Prompt caches persist between runs. Re-run the same test prompts and they cache at ~100% — a gorgeous number that has nothing to do with production, where every input is new. Fix: prepend a unique per-run nonce so each candidate measures its true caching.
🪤 Non-determinism. A reasoning model ignores
temperature=0and re-samples every call, so a single before/after can't tell a real change from a coin flip. Fix: an A/A run — the identical prompt twice — gives you a noise floor. A change is real only if it clears that floor.
Technique 1 — Model choice: the biggest lever, and it isn't only price
A model is a single point in three dimensions: price, throughput, and quality. The cheapest token loses if the model can't keep up or can't tell a receipt from a newsletter. So pick by your binding constraint, not the price-per-token column:
If your bottleneck is… | optimize for… | watch out for… |
|---|---|---|
Bursty volume / a queue draining | throughput (concurrency, req/s) | premium models that are fast and pricey |
Steady high volume | price per token | cheap models that rate-limit |
Ambiguous inputs | precision | cheap models that over-flag |
In practice. Our scan's model was swapped twice — and the cost was non-monotonic (it went up before it came down), because each switch fixed one corner and surrendered another:
model
$ / mailbox
throughput
quality
why we moved
mistral-large (original)
~$0.66
~0.25 req/s ⚠
over-flags
too slow — 50% of calls 429'd
glm-5.1
~$1.64
10 concurrent
over-flags
fixed speed, 3.7× the cost
deepseek-v4-pro (now)
~$0.40
500 concurrent
best
first to win cost and speed
The predecessors flagged more email — but the extras were look-alikes ("rate your hotel stay," "your trial is ending"): lower precision, not better recall. Where it counted, all three extracted the same amounts. The upgrade traded nothing real away.
Takeaway: measure all three axes at once. A model-choice "win" on price that silently loses throughput or precision is a loss you'll discover in production.
Technique 2 — Cache-aware prompts: stop re-paying for your own instructions
Providers cache the longest identical leading prefix of a request and bill it at the hit rate — often ~100× cheaper than a fresh miss. One rule follows, and it's geometric:
Every token before the first byte that changes between two calls can be nearly free. Every token after it is full price.
Most prompts are written for the human reading them, so they splice per-request data into the middle of the instructions — and every splice is a cache guillotine that re-bills the identical rules after it on every call:
The cache boundary sits wherever the prompt first changes between calls. Green = cached at the hit rate; red = re-billed at full price, every call. A per-request line in the middle drags the boundary up; moving it to the end pushes the boundary down — caching the whole rule block. Same words, only order:
graph LR
subgraph BEFORE["❌ Before — a mid-prompt variable cuts the cache early"]
direction TB
b1["Rules — part 1"] -->|"✂ cache breaks"| b2["Attachments: receipt-4823.pdf"] --> b3["Rules — part 2"] --> b4["The input"]
end
subgraph AFTER["✅ After — whole instruction block static, variables last"]
direction TB
a1["Rules — part 1"] --> a2["Rules — part 2"] -->|"✂ cache breaks"| a3["Attachments: receipt-4823.pdf"] --> a4["The input"]
end
style b1 fill:#7fc47f,color:#000
style b2 fill:#f77f7f,color:#fff
style b3 fill:#f77f7f,color:#fff
style b4 fill:#f77f7f,color:#fff
style a1 fill:#7fc47f,color:#000
style a2 fill:#7fc47f,color:#000
style a3 fill:#f77f7f,color:#fff
style a4 fill:#f77f7f,color:#fff"After" caches its entire rule block on every call; "before" re-pays for half of it, forever — with byte-identical behavior either way.
What it looks like in code. Two interpolations sink most prompts — and both are things you'd write without a second thought:
// ❌ BEFORE — two everyday interpolations, both in the wrong place
function buildPrompt(email: Email): string {
return `Current date: ${today()}
You are an expense classifier. Rules:
[ ...hundreds of tokens of rules, identical on every call... ]
${email.hasAttachments ? `Attachments: ${email.attachmentNames.join(", ")}` : ""}
[ ...more rules... ]
Email:
${email.body}`;
}Two cache-killers hide in plain sight:
Current date:at the top changes daily, so the prefix differs from byte one — the entire prompt, rules and all, misses the cache every single day. This is the most common version of the mistake:Today is {date}pinned to the top of a system prompt.The
hasAttachmentsternary in the middle cuts the cache right there — every rule after it is re-billed on every call, even though it never changes.
The fix is to make the instruction block a constant and let every variable fall to a tail after it:
// ✅ AFTER — one constant instruction block; all variables in the tail
const SYSTEM = `You are an expense classifier. Rules:
[ ...all the rules, identical on every call... ]
- If attachments are listed below, treat the email as likely financial.`;
function buildPrompt(email: Email): string {
return `${SYSTEM}
Current date: ${today()}
Attachments: ${email.attachmentNames.join(", ") || "none"}
Email:
${email.body}`;
}SYSTEM is byte-identical on every call, so it caches in full; today(), the
attachment list, and the body — the only things that vary — sit after the
cached prefix, costing what they should and nothing more.
In practice. Reordering alone lifted the input cache-hit rate 41% → 51% and cut cost 9%, with provably identical output. The cheapest win in the whole guide: you're not asking the model to do anything different — you're letting the cache do its job.
This technique is "free" on the square: cost ⬇, accuracy and observability untouched. Always do it first.
Technique 3 — Output-aware design: the cost you forget, traded for the insight you don't need
Caching attacks the input. But the input is rarely where the money is — the
output is, and you can never cache it. On a reasoning model the bill is
dominated by reasoning_tokens (the model "thinking," billed at the full output
rate). You can't cache thinking. You can only ask for less.
This is the one technique that touches the Observability corner — so it's where the trade gets explicit. Two cuts:
Drop fields the consumer never reads. We asked for a human-readable
reasonon every call. On a reasoning model the rationale already exists in the thinking phase — re-emitting it in the answer just re-pays for it in expensive output tokens.Short-circuit the common case. ~80% of email is not an expense. For those, every field but
is_invoice: falseis dead weight. Return only that.
In practice. Asking for the minimum didn't just shrink the answer — telling a reasoning model to be terse made it reason less: output −34%, reasoning tokens −29%, cost another −18%.
Here is the trade, stated honestly. Dropping reason cost us
observability — that string was the per-email explanation in our scan log
("Marketing newsletter," "loyalty points"). Without it the log reads generic
("Not an expense"). The principle we used to decide:
Keep observability where it's diagnostic. Pay for it where it's load-bearing. Don't bill the user for it on the hot path.
Where | Keep | Why |
|---|---|---|
Internal eval / debugging a sample | ✅ yes | we're reading it, on our own dime |
Production, every email, forever | ❌ no | cost with no consumer — the purest waste |
The harness can switch reason back on for internal runs. In production it was
cost without a reader — so it went. That's the move: not "remove
observability," but locate it — diagnostic detail belongs in the path you run a
thousand times, not the one you run a billion.
Technique 4 — Batching: trade latency for half the bill
The first three techniques shrink the cost of each call. Batching changes the price of a call by changing how you send it: most providers run an asynchronous batch endpoint at roughly half price for work that doesn't need an answer this second.
Provider | Batch endpoint | Typical discount |
|---|---|---|
Google Gemini | Batch Mode | ~50% (2×) |
OpenAI | Batch API | ~50% |
Anthropic | Message Batches | ~50% |
Self-hosted (vLLM / TGI) | continuous batching | max GPU utilization |
The trade is latency, and it's a real one: a batch is only finished when every request in it is finished, so individual results arrive later. That's fatal for an interactive chat — and nearly free for anything asynchronous: a nightly job, a queue drain, an inbox scan. If no human is watching a spinner, batching is close to a pure cost win.
The scheduler is where it lives. If your pipeline already rate-limits its LLM calls through a queue — and at production volume it should — that queue is the natural place to batch. Instead of firing one call per item, the scheduler collects items until it has N of them (or T milliseconds pass), sends one batch request, and fans the results back out:
graph LR
R1["request"] --> Q
R2["request"] --> Q
R3["request"] --> Q
Q["scheduler — collect until N items or T ms"] --> B["one batch call — ~½ price"]
B --> O1["result"]
B --> O2["result"]
B --> O3["result"]
style R1 fill:#f7c47f,color:#000
style R2 fill:#f7c47f,color:#000
style R3 fill:#f7c47f,color:#000
style Q fill:#7fc4c4,color:#000
style B fill:#4f8cf7,color:#fff
style O1 fill:#7fc47f,color:#000
style O2 fill:#7fc47f,color:#000
style O3 fill:#7fc47f,color:#000The same idea reaches past hosted APIs. On a self-hosted model, batching is how you actually use the GPU you're paying for: modern serving stacks (vLLM, TGI) do continuous batching out of the box — interleaving many requests through the forward pass instead of one-at-a-time — but the throughput you realize still depends on tuning it (batch size, max tokens, how long you'll wait to fill a batch). Idle compute is the most expensive token of all.
Where it fits. Batching is the lever for high-volume, latency-tolerant work — exactly the shape of an async pipeline like the inbox scan, where a worker already drains a rate-limited queue (the scheduler is right there, waiting to be taught to batch). It's the one technique here that spends performance for cost instead of getting it free — so it earns its place only where latency is slack. Where it is, ~2× is on the table.
Putting it together: what to trade, what to refuse
Stack the three per-call techniques and the per-mailbox cost walked down like this — while throughput went up and accuracy never moved (batching, the fourth lever, we'd reach for next as volume grows):
step | change | $ / mailbox |
|---|---|---|
original | mistral-large | ~$0.66 (rate-capped) |
model | → deepseek-v4-pro | ~$0.52 |
cache | + static-prefix prompt | ~$0.47 |
output | + minimized output | ~$0.38 |
But the number isn't the lesson. The lesson is the posture you take on each corner — decided on purpose, proven with the test:
Corner | Our posture | The general rule |
|---|---|---|
🎯 Accuracy | inviolable | a cheaper pipeline that loses correctness isn't cheaper — it's broken |
⚡ Performance | kept — except latency | paid for concurrency to keep up; the one slice we'd spend is latency (batching), where it's slack |
🔍 Observability | spent — surgically | forfeited on the hot path, kept in the lab |
💰 Cost | the thing we pulled down | every multiple out is a multiple the user doesn't pay |
Your feature may answer differently. A medical or legal classifier might keep every reason token and eat the cost, because there the explanation is the product. The framework doesn't tell you what to value — it forces you to price each thing you value, so the trade is a decision you made, not a cost you absorbed by accident.
And there's a reason to pull cost down that has nothing to do with margin: inference cost is priced into what users pay. Every multiple you remove is a multiple off their bill — a cheaper product, a wider audience, a system more people can afford to run. Optimization, done right, isn't how you keep more. It's how you let more people in.
If you're the kind of engineer who read this far, there's a part of Guliel built for you: a fully typed REST API and an MCP server, so you can wire your own financial operations — issue documents, pull reports, reconcile and scan expenses — straight from your own code or an AI agent. The same discipline that keeps our inference honest is what keeps that surface clean enough to hand to you.
Explore the Guliel API & MCP →
— Sapir