How to Reduce LLM API Costs by 80% in 2026
LLM APIs are powerful but expensive at scale. A single SaaS product that makes 1 million API calls per month can easily spend $3,000–$15,000 depending on which model you use and how you structure your prompts. The good news: most teams are leaving 60–80% savings on the table through avoidable inefficiencies. This guide covers ten techniques you can implement this week.
| Technique | Typical Savings | Effort |
|---|---|---|
| Prompt compression | 20–35% | Low |
| Semantic caching | 30–60% | Medium |
| Model routing | 40–70% | Medium |
| Context pruning | 15–40% | Medium |
| Batch requests | 50% (via batch pricing) | Low |
| Output length control | 10–30% | Low |
| Task-appropriate model selection | 50–80% | Low |
| Embeddings optimization | 20–40% | Medium |
| Streaming (UX, not cost) | 0–10% | Low |
| Retry logic with jitter | 5–15% | Low |
1. Prompt Compression
Verbose instructions don't make models smarter — they just cost more. Most system prompts can be trimmed by 20–40% without losing task accuracy. Remove filler phrases, redundant context, and prose where structured instructions work better.
# BEFORE — 87 tokens
You are a helpful customer support assistant. Please always be
polite and professional. When a user asks you a question, make
sure you provide a clear and detailed answer. If you don't know
the answer, say so honestly.
# AFTER — 28 tokens
Customer support assistant. Be concise and accurate.
If unknown, say so.
That's a 68% reduction for the system prompt alone. Multiply that across every API call and the savings compound fast.
2. Semantic Caching
Semantic caching stores LLM responses and serves cached answers when a new query is semantically similar to a previous one — even if the wording differs. Tools like GPTCache or a simple Redis + embedding layer can hit cache rates of 30–60% on typical support chatbots.
import hashlib, json
from redis import Redis
import numpy as np
r = Redis()
def semantic_cache_lookup(query_embedding, threshold=0.92):
# Get all cached embeddings
keys = r.keys("cache:emb:*")
for key in keys:
stored = np.frombuffer(r.get(key), dtype=np.float32)
similarity = np.dot(query_embedding, stored) / (
np.linalg.norm(query_embedding) * np.linalg.norm(stored)
)
if similarity >= threshold:
answer_key = key.decode().replace("emb:", "ans:")
return r.get(answer_key).decode()
return None
3. Model Routing
Not every task needs GPT-4o. A simple classifier can route queries to cheap models for easy tasks and expensive models only for hard ones. At scale, this is the single highest-leverage optimization available.
def route_to_model(task_complexity: str) -> str:
routing = {
"simple_qa": "gemini-2.5-flash", # $0.15/1M input
"classification": "gemini-2.5-flash",
"summarization": "claude-haiku-3-5", # ~$0.80/1M input
"code_generation": "gpt-4o", # $2.50/1M input
"reasoning": "claude-sonnet-4-6", # $3.00/1M input
"analysis": "claude-sonnet-4-6",
}
return routing.get(task_complexity, "gpt-4o")
If 70% of your traffic is simple Q&A and you route it to Gemini Flash instead of GPT-4o, you cut those call costs by 94%.
4. Context Pruning
Conversations get expensive because you send the full history on every turn. After 5–6 turns, consider replacing older messages with a rolling summary. A 20-turn conversation can have its context compressed from ~8,000 tokens to ~1,500 tokens without meaningful quality loss.
async def prune_conversation(messages: list, max_tokens: int = 2000):
total = count_tokens(messages)
if total <= max_tokens:
return messages
# Keep system prompt + last 4 messages
system = [m for m in messages if m["role"] == "system"]
recent = messages[-4:]
# Summarize the middle
middle = messages[len(system):-4]
summary = await llm_summarize(middle, model="gemini-2.5-flash")
return system + [{"role": "assistant",
"content": f"[Summary: {summary}]"}] + recent
5. Batch Requests
OpenAI's Batch API and Anthropic's Message Batches API offer 50% discounts for asynchronous workloads with up to 24-hour turnaround. If you're running nightly enrichment jobs, document processing pipelines, or evaluation runs, there is no reason not to use batch mode.
# Anthropic Batch API — 50% off regular pricing
import anthropic
client = anthropic.Anthropic()
batch = client.messages.batches.create(
requests=[
{"custom_id": f"req-{i}", "params": {
"model": "claude-sonnet-4-6",
"max_tokens": 256,
"messages": [{"role": "user", "content": doc}]
}}
for i, doc in enumerate(documents)
]
)
6. Output Length Control
Models tend to be verbose by default. Explicitly setting max_tokens and including instructions like "Reply in under 100 words" or "Return JSON only, no explanation" can cut output token costs by 20–40%.
# Instead of open-ended output:
# "Analyze this customer complaint and tell me what to do"
# Use constrained output:
system = """Classify customer complaint. Return JSON only:
{"category": "billing|technical|general",
"priority": "high|medium|low",
"action": "string max 15 words"}"""
7. Task-Appropriate Model Selection
Use the cheapest model that reliably handles your task. Here's a practical hierarchy for 2026:
- Gemini 2.5 Flash ($0.15/1M in) — classification, extraction, simple Q&A, translation
- Claude Haiku 3.5 (~$0.80/1M in) — summarization, formatting, short-form generation
- GPT-4o ($2.50/1M in) — complex coding, nuanced writing, multi-step reasoning
- Claude Sonnet 4.6 ($3.00/1M in) — long-context analysis, coding, agentic tasks
8. Embeddings Optimization
If you're using embeddings for RAG (retrieval-augmented generation), the embedding model choice matters significantly. OpenAI's text-embedding-3-small at $0.02/1M tokens is 5x cheaper than text-embedding-3-large at $0.13/1M tokens, with only marginal quality loss for most retrieval tasks. Also cache embeddings aggressively — documents rarely change.
# Cache embeddings in SQLite to avoid re-computing
import sqlite3, json
from openai import OpenAI
def get_or_compute_embedding(text: str, client: OpenAI) -> list:
h = hashlib.sha256(text.encode()).hexdigest()
row = db.execute("SELECT vec FROM emb_cache WHERE hash=?", (h,)).fetchone()
if row:
return json.loads(row[0])
resp = client.embeddings.create(
input=text,
model="text-embedding-3-small" # $0.02/1M — not $0.13/1M
)
vec = resp.data[0].embedding
db.execute("INSERT INTO emb_cache VALUES (?,?)", (h, json.dumps(vec)))
return vec
9. Streaming for UX, Timeouts for Cost
Streaming itself doesn't reduce token costs — you pay for the same tokens either way. But it does let you implement early stopping: if the user navigates away or the output becomes clearly wrong, you can abort the stream and avoid paying for unneeded output tokens. Set a hard max_tokens budget per request to prevent runaway outputs.
10. Retry Logic with Exponential Backoff and Jitter
Naive retry loops can cause thundering herd problems where all retries hit simultaneously, burning tokens on requests that fail again. Proper backoff with jitter prevents this. Also, only retry on retriable errors (429 rate limit, 500 server error) — not on 400 validation errors, which you'll never recover from automatically.
import time, random
def call_with_retry(fn, max_retries=4):
for attempt in range(max_retries):
try:
return fn()
except RateLimitError:
if attempt == max_retries - 1:
raise
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)
except InvalidRequestError:
raise # Don't retry client errors
Putting It All Together
Combining these techniques is multiplicative, not additive. A team using model routing (60% savings) + prompt compression (30% savings) + semantic caching (40% cache hit rate) achieves roughly:
Effective cost = base_cost × (1 - 0.60) × (1 - 0.30) × (1 - 0.40) ≈ 16% of original
That's an 84% cost reduction — and the numbers above are conservative. Start with model routing for highest leverage, then add semantic caching, then compress prompts. Each step is measurable with a token counter like Tokenia.
Pro tip: Before optimizing, measure first. Paste your actual prompts into Tokenia to see exactly how many tokens each prompt consumes and what it costs per model.
Count your tokens before optimizing
Paste any prompt into Tokenia to see token counts and costs across GPT-4o, Claude, Gemini, and 30+ other models — instantly, for free, with no sign-up.
Try Tokenia Free →