How do I estimate the monthly cost of an LLM API?

Monthly cost = (monthly requests × input tokens per request × input price per million) + (monthly requests × output tokens per request × output price per million). The critical step most estimates skip is counting input and output tokens separately, because output is priced 3–5× higher than input on every major provider. Measure actual token distributions from at least 1,000 production requests (every API returns a usage object), then model three scenarios — current pricing, +50% price increase, and +200% — so a price shock does not break your unit economics.

Why do output tokens cost more than input tokens?

The transformer prefill phase processes the entire input in one parallel forward pass, which is computationally dense but fast per token. The decode phase generates output one token at a time in an autoregressive loop, running a full forward pass and reloading all model weights and the KV cache on every step. That memory-bandwidth bottleneck is why providers price output tokens roughly 3–5× higher than input — Claude 3.5 Sonnet is $3/M input versus $15/M output, a 5× ratio.

What is the single biggest lever for reducing LLM cost?

Model selection. The 2026 market has stratified into a budget tier ($0.10–$0.60/M input), a balanced tier ($1–$5/M), and a frontier/reasoning tier ($10–$60/M), with 10–50× cost differences between them. Using the cheapest model that reliably passes your task-specific evals is more impactful than any infrastructure optimization. A benchmark chatbot at 300K messages/month costs about $1,350/mo on Claude Sonnet versus about $112.50/mo on Claude Haiku — a 12× swing from tier choice alone.

When does it make sense to self-host an open-weight model?

As a rule of thumb, self-hosting begins to favor the economics when monthly API spend on a given tier exceeds roughly $5,000–$8,000 and you have a team that can run model-serving infrastructure (vLLM or TGI). Below $5,000/mo the operational overhead usually exceeds the savings; above $20,000/mo self-hosting is almost always justified if an open-weight model meets your quality bar. Frontier reasoning tasks are the exception — no open-weight equivalent to the top proprietary reasoning models exists at comparable quality yet.

How much can prompt caching save on LLM costs?

Caching discounts only the stable prefix of a prompt — a shared system prompt or fixed document context reused across calls — not the user-specific tokens that change every request. Anthropic charges about 10% of the standard rate on cache reads and a ~125% premium on cache writes; OpenAI applies an automatic ~50% discount on cached prefixes of 1,024+ tokens. On a large, frequently reused system prompt at a high cache hit rate, that can cut the prefix portion of your bill by 50–90%. It can cost more than it saves for low-volume workloads where calls arrive less often than the cache TTL.

API Costs

LLM API cost guide 2026: what every developer needs to know before they ship

Updated May 23, 2026 · By Byron Malone

TL;DR:A chatbot handling 10,000 messages/day at 500 input + 200 output tokens per message costs approximately $450/mo on Claude Haiku, $3,750/mo on GPT-4o, and $6,750/mo on Claude 3.5 Sonnet. Model choice is your biggest single cost lever — often 8–15× the difference between tiers. Prompt caching and output compression are the next two levers. Consult your provider's current pricing page before any production cost planning.

Why LLM cost estimates are almost always wrong — and how to fix yours

Most developer cost estimates for LLM APIs fail in the same three ways: they count input tokens only, they use a single-model assumption, and they scale linearly without accounting for caching or routing. The result is estimates that are off by 2–10× in either direction — usually underestimates in production.

The most common error is conflating input and output token pricing. Every major provider charges significantly more for output tokens than input tokens. Claude 3.5 Sonnet: input $3/M, output $15/M — a 5× ratio. GPT-4o: input $5/M, output $15/M — a 3× ratio. If your application generates substantial output (agents, summarization, code generation), your real cost is 3–5× higher than an input-only estimate.

The second error is using prompt token counts from development that don't match production. Developers typically test with short prompts and small outputs. Production adds: the full system prompt (often 500–2,000 tokens), conversation history accumulation across turns (growing context window), retrieved document chunks in RAG applications (1,000–10,000 tokens per query), and tool call schemas if using function calling. Martin Casado and Matt Bornstein (a16z, 2023, “Who Owns the Generative AI Platform?”) documented that production LLM applications routinely spend 60–80% of their inference budget on context that is the same across requests — an ideal target for caching, but only if you've measured it.

The third error is not counting the full API surface. Embeddings (for vector search / RAG), image processing (vision models), moderation calls, and fine-tuning runs all have separate pricing structures that are usually not on your main model's pricing page. This guide covers the main token-generation APIs; see your provider's full pricing page for the complete surface.

How to fix your estimates: (1) measure actual token distributions in production for at least 1,000 requests — both input and output separately; (2) log the token breakdown from your API responses (every provider returns this); (3) model three scenarios: current pricing × current volume, +50% AI price increase, +200% AI price increase. Per the Bedrocka AI Cost Resilience framework (AICR-006), deals should clear your hurdle rate at +50% to advance past planning.

The token math: input vs. output, context windows, and why output costs more

Tokens are the unit of billing for all major LLM APIs. A token is approximately 4 characters of English text — so 1,000 words ≈ 1,333 tokens, or 750 words ≈ 1,000 tokens. Non-English languages tokenize differently; some languages (Chinese, Japanese, Arabic) can be significantly denser in token count for the same semantic content, which affects cost estimates for multilingual applications.

Why output tokens cost more.The transformer attention mechanism has different computational demands for input (prefill phase) and output (decode phase). Prefill processes the entire input in a single parallel forward pass — computationally dense but fast per token. Decode generates one token at a time in an autoregressive loop, running a full forward pass per output token. The memory bandwidth requirements for decode — loading all model weights and the KV cache on every step — are the primary bottleneck. This cost structure is why output tokens are priced 3–5× higher: they consume proportionally more compute per token. Per Chip Huyen's “AI Engineering” (O'Reilly, 2025), the decode phase is typically 3–5× more expensive in GPU memory bandwidth per token than the prefill phase.

Context window pricing tiers. Gemini 1.5 Pro and Gemini 1.5 Flash have a pricing tier at 128K tokens — inputs below 128K are priced at one rate, inputs above 128K at a higher rate. Gemini 1.5 Pro: $3.50/M input below 128K, $7/M above. Flash: $0.075/M below, $0.15/M above. This is a meaningful constraint for long-context applications (document analysis, long-session chat, RAG with large retrieved contexts). Claude and GPT-4o do not have a 128K pricing tier; context window billing is uniform up to the full context limit.

Token counting in code.OpenAI provides the tiktoken Python library for accurate token counting before API calls (verified: tiktoken.get_encoding(“cl100k_base”) for GPT-4 models). Anthropic provides their own tokenizer. Google provides a countTokens API endpoint. For production cost estimation, always measure actual token counts from API responses (the usage object in the response), not pre-call estimates — the models' actual tokenization may differ from your estimate.

Model pricing reference (May 2026 — verify at official pricing pages):

Claude 3.5 Sonnet:    $3.00/M input    $15.00/M output
Claude 3 Haiku:       $0.25/M input    $1.25/M output
GPT-4o:               $5.00/M input    $15.00/M output
GPT-4o mini:          $0.15/M input    $0.60/M output
Gemini 1.5 Pro:       $3.50/M input    $10.50/M output  (≤128K)
Gemini 1.5 Flash:     $0.075/M input   $0.30/M output   (≤128K)
Llama 3.1 70B (Groq): $0.59/M input    $0.79/M output
Mistral Large 2:      $3.00/M input    $9.00/M output

Sources: anthropic.com/api, platform.openai.com/pricing,
ai.google.dev/pricing, groq.com/pricing, mistral.ai/technology/#pricing

Model tiers in 2026: budget, balanced, frontier — and when each makes sense

As of May 2026, the LLM market has effectively stratified into three tiers with 10–50× cost differences between them. Understanding which tier is appropriate for each task type is the most powerful cost optimization decision you can make — more impactful than any infrastructure optimization.

Budget tier ($0.10–$0.60/M input): GPT-4o mini ($0.15/M), Claude 3 Haiku ($0.25/M), Gemini 1.5 Flash ($0.075/M), Llama 3.1 70B via Groq ($0.59/M). Appropriate for: classification, routing, extraction, simple Q&A, summarization of well-structured text, short-form generation where consistency is verified downstream. MMLU scores in the 75–82% range for this tier. Not appropriate for: complex multi-step reasoning, ambiguous instructions with edge cases, tasks requiring strong calibration, technical code generation in unfamiliar frameworks.
Balanced tier ($1–$5/M input): Claude 3.5 Sonnet ($3/M), GPT-4o ($5/M), Gemini 1.5 Pro ($3.50/M), Mistral Large 2 ($3/M). Appropriate for: customer-facing applications where quality is visible, code generation, agentic tasks with tool use, complex document analysis, multi-step reasoning chains. MMLU scores 85–90%. The best cost-to-quality tradeoff for most production AI features per Artificial Analysis composite benchmarks (artificialanalysis.ai, May 2026).
Frontier/reasoning tier ($10–$60/M input): o1, o3, claude-3-5-sonnet with extended thinking. Appropriate for: tasks with objectively verifiable correct answers where quality matters more than cost (math, formal verification, scientific reasoning, complex coding on novel problems). TTFT of 10,000–60,000ms makes these structurally unsuitable for synchronous UX. Cost at 10× the balanced tier is only justified if the reasoning premium translates to measurable task performance improvements for your specific task.

Chip Huyen documents in “AI Engineering” (O'Reilly, 2025) the pattern of tiered model selection for production AI systems: use the cheapest model that reliably solves the task, measure quality via task-specific evals, and only upgrade the tier when evals show the cheaper model is failing on production inputs. The common anti-pattern is starting with the best model (to avoid quality issues) and never measuring whether a cheaper model would have sufficed.

The five cost optimization levers: caching, routing, context pruning, batch, and fine-tuning

After model selection, five levers reduce inference cost in approximate order of implementation effort vs. savings ratio:

Prompt caching (2–4 hours, 50–90% savings on cached tokens). If your application sends the same system prompt or document context on every request, prompt caching is the highest-ROI optimization. Claude: explicit cache_control markers, cache read = $0.30/M vs. $3/M standard (90% off). OpenAI: automatic caching, cached prefix = $2.50/M vs. $5/M (50% off). Minimum cacheable prefix: 1,024 tokens on both. Break-even for Claude Sonnet caching: 1.39 cache reads per cache write. See the Prompt Caching Savings Calculator and the full caching guide on this site.
Model routing (1–2 days, 40–80% cost reduction on routable volume).Route easy requests to budget-tier models and hard requests to frontier models. Requires a classification layer (which can itself be a cheap model or a rules-based router) and per-task evals to define “easy” vs. “hard.” Nathan Lambert (Hugging Face) and the Martian routing team have documented routing savings of 40–80% on mixed workloads where request difficulty is highly variable.
Context pruning (ongoing, 10–40% cost reduction). Audit your system prompt and conversation history for tokens that don't contribute to response quality: duplicate instructions, verbose formatting that could be compressed, full conversation history when a summary would suffice. In long-session applications, conversation history is the fastest-growing cost line — summarize every N turns to keep the context window bounded.
Batch API (same-day implementation, 50% discount on OpenAI). OpenAI Batch API processes requests at 50% of standard pricing with a 24-hour turnaround SLA. Anthropic Message Batches provides a similar 50% discount for asynchronous processing. For offline use cases (nightly data processing, document classification pipelines, bulk embedding generation), batch is a free money optimization — no quality change, half the cost.
Fine-tuning (1–2 weeks, use as a last resort). Fine-tuning smaller models to match the output quality of larger models on specific tasks can dramatically reduce per-call cost. The economics work when: the task is narrow and well-defined, you have 100+ labeled examples, and the volume justifies the fine-tuning investment. Fine-tuning adds operational complexity (model versioning, retraining on data drift) and is typically not the first optimization to reach for.

Estimating production costs before launch: a framework for realistic projections

Use this framework to build a production cost model before you have real traffic data:

Define your token budget per request. For each request type in your application: (a) system prompt token count (measure with your tokenizer); (b) average user input token count (estimate from your test inputs, add a buffer); (c) expected output token count (measure from development outputs, output tokens are the variable you have the least control over — add a larger buffer). Sum these for total tokens per request.
Model your request volume. Start with current traffic if available. For new applications, model three scenarios: P10 (pessimistic), P50 (base), P90 (optimistic). If you have user growth targets, model cost growth against them — LLM costs scale linearly with requests at fixed token-per-request count.
Apply the cost formula. Monthly cost = (monthly requests × input tokens per request × input price/M) + (monthly requests × output tokens per request × output price/M). For applications with caching: subtract the cost reduction from cached input tokens. Run the formula at the P10, P50, and P90 scenarios.
Stress-test against price change scenarios. Multiply your P50 monthly cost by 1.5 (50% price increase scenario) and by 3 (200% increase scenario). If the 150% scenario exceeds your cost tolerance, build in a model routing or caching optimization before launch — not after.
Benchmark case (chatbot). 10,000 messages/day (300K/month) × 500 input tokens × $3/M + 300K × 200 output tokens × $15/M = $450 + $900 = $1,350/mo on Claude Sonnet. On Claude Haiku: 300K × 500 × $0.25/M + 300K × 200 × $1.25/M = $37.50 + $75 = $112.50/mo. The 12× difference illustrates the tier selection decision.

Simon Willison (simonwillison.net) documents a useful production measurement practice: log the full usage object (input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens) from every API response, compute daily and weekly cost totals, and alert when daily cost exceeds a threshold. Without this instrumentation, production cost surprises are almost inevitable.

When to switch from API to self-hosted: the real break-even math

Self-hosting open-weight LLMs (Llama, Mistral, Qwen) eliminates per-token API costs but introduces infrastructure costs, operational overhead, and quality tradeoffs. The decision is not purely economic — regulatory requirements (data sovereignty, PHI handling), latency requirements, and customization needs all factor in. But the economic break-even is the starting point.

Infrastructure cost estimate for self-hosting Llama 3.1 70B:

Minimum hardware: 2× NVIDIA A100 80GB ($5,000–$6,000/mo on AWS p4d.24xlarge on-demand; $2,000–$3,000/mo reserved). On-premises GPU servers at $30,000–$80,000 one-time cost per 2-GPU node, amortized over 3 years = ~$800–$2,200/mo.
Operational overhead: model serving infrastructure (vLLM, TGI, or custom), DevOps time for updates and maintenance, model version management. Estimate 0.25–0.5 FTE at loaded cost for a mid-size team.
Quality tradeoff: Llama 3.1 70B MMLU ≈ 82% vs. Claude Sonnet ≈ 88% (Artificial Analysis, May 2026). For tasks where this quality gap matters, self-hosting the open-weight model is not equivalent to paying for the API.

Break-even rule of thumb: if your monthly API spend on a given model tier exceeds $5,000–$8,000/mo and you have a team capable of running model serving infrastructure, the economics of self-hosting begin to favor it. Below $5,000/mo, the operational overhead typically exceeds the savings. Above $20,000/mo, self-hosting is almost always economically justified if quality requirements can be met with an open-weight model.

Andrej Karpathy (2023 NeurIPS talk and various tweets) has argued that frontier proprietary models will maintain a 1–2 generation lead over open-weight models for the foreseeable future, which means self-hosting is currently best suited for use cases where balanced-tier (not frontier) quality is sufficient. For tasks requiring frontier reasoning, the economic case for self-hosting is not yet strong because no open-weight equivalent to o1/o3 or Claude Sonnet exists at comparable performance.

The LLM API Cost Calculator on this site lets you model the break-even scenario directly: enter your expected monthly token volume, your current API cost, your estimated self-hosting infrastructure cost, and your quality-adjusted model tier — and compute the break-even month.

How I size a production LLM bill before launch — a worked example

When I model a new AI feature’s cost, I’ve found the estimate is almost always wrong in the same direction — too low — because teams price on input tokens and forget output dominates the bill. Worked example: a support chatbot doing 10,000 messages/day (300K/month) at 500 input and 200 output tokens per message. On Claude 3.5 Sonnet that is 300K × 500 × $3/M for input ($450) plus 300K × 200 × $15/M for output ($900), so about $1,350/mo — and the $900 output line is twice the input line. Run the identical workload on Claude Haiku ($0.25/M input, $1.25/M output) and it falls to roughly $112.50/mo. Same traffic, a 12× cost difference decided entirely by tier selection — which is why I always model the cheapest model that passes the task evals first, then stress-test the P50 number at +50% and +200% pricing before committing.

Assumptions:per-million token rates are dated examples (May 2026) and change frequently — verify each provider’s current pricing page before production planning. The example assumes uniform token counts per message, no prompt caching, and no batch discount; real distributions vary, so measure actual input and output tokens from your API usage objects across at least 1,000 requests. Figures are illustrative and are not financial advice.

The cost formula, the tier model, and the assumptions above are operationalized in the token-pricing methodology and the open-source calculator source on GitHub (packages/calc).

Frequently asked questions

Last reviewed by Byron Malone, 2026-05-23. Pricing figures sourced from official provider pricing pages. Model benchmark figures sourced from Artificial Analysis (artificialanalysis.ai). This article is not financial advice. LLM pricing changes frequently — verify at your provider's current pricing page before production cost planning.

By Byron MaloneLast verified May 2026 against Anthropic, OpenAI & Google AI provider pricing pages; Artificial Analysis benchmarks

Founder & Editor, Bedrocka Tools

Try the calculator

This article pairs with the LLM API Cost Calculator — model your own input/output token mix, request volume, and price-shock scenarios.