Also in: Español Português

How to Reduce LLM API Costs by 80% in 2026

May 31, 2026 · Tokenia Team · 8 min read

LLM APIs are powerful but expensive at scale. A single SaaS product that makes 1 million API calls per month can easily spend $3,000–$15,000 depending on which model you use and how you structure your prompts. The good news: most teams are leaving 60–80% savings on the table through avoidable inefficiencies. This guide covers ten techniques you can implement this week.

Technique	Typical Savings	Effort
Prompt compression	20–35%	Low
Semantic caching	30–60%	Medium
Model routing	40–70%	Medium
Context pruning	15–40%	Medium
Batch requests	50% (via batch pricing)	Low
Output length control	10–30%	Low
Task-appropriate model selection	50–80%	Low
Embeddings optimization	20–40%	Medium
Streaming (UX, not cost)	0–10%	Low
Retry logic with jitter	5–15%	Low

1. Prompt Compression

Verbose instructions don't make models smarter — they just cost more. Most system prompts can be trimmed by 20–40% without losing task accuracy. Remove filler phrases, redundant context, and prose where structured instructions work better.

# BEFORE — 87 tokens
You are a helpful customer support assistant. Please always be
polite and professional. When a user asks you a question, make
sure you provide a clear and detailed answer. If you don't know
the answer, say so honestly.

# AFTER — 28 tokens
Customer support assistant. Be concise and accurate.
If unknown, say so.

That's a 68% reduction for the system prompt alone. Multiply that across every API call and the savings compound fast.

2. Semantic Caching

Semantic caching stores LLM responses and serves cached answers when a new query is semantically similar to a previous one — even if the wording differs. Tools like GPTCache or a simple Redis + embedding layer can hit cache rates of 30–60% on typical support chatbots.

import hashlib, json
from redis import Redis
import numpy as np

r = Redis()

def semantic_cache_lookup(query_embedding, threshold=0.92):
    # Get all cached embeddings
    keys = r.keys("cache:emb:*")
    for key in keys:
        stored = np.frombuffer(r.get(key), dtype=np.float32)
        similarity = np.dot(query_embedding, stored) / (
            np.linalg.norm(query_embedding) * np.linalg.norm(stored)
        )
        if similarity >= threshold:
            answer_key = key.decode().replace("emb:", "ans:")
            return r.get(answer_key).decode()
    return None

3. Model Routing

Not every task needs a frontier model like GPT-5.4. A simple classifier can route queries to cheap models for easy tasks and expensive models only for hard ones. At scale, this is the single highest-leverage optimization available.

def route_to_model(task_complexity: str) -> str:
    routing = {
        "simple_qa":       "gemini-2.5-flash",    # $0.15/1M input
        "classification":  "gemini-2.5-flash",
        "summarization":   "claude-haiku-4-5",     # ~$0.80/1M input
        "code_generation": "gpt-4o",               # $2.50/1M input
        "reasoning":       "claude-sonnet-4-6",    # $3.00/1M input
        "analysis":        "claude-sonnet-4-6",
    }
    return routing.get(task_complexity, "gpt-4o")

If 70% of your traffic is simple Q&A and you route it to a small model like GPT-5 nano ($0.05/1M) instead of a frontier model, you cut those call costs by 95%+.

4. Context Pruning

Conversations get expensive because you send the full history on every turn. After 5–6 turns, consider replacing older messages with a rolling summary. A 20-turn conversation can have its context compressed from ~8,000 tokens to ~1,500 tokens without meaningful quality loss.

async def prune_conversation(messages: list, max_tokens: int = 2000):
    total = count_tokens(messages)
    if total <= max_tokens:
        return messages

    # Keep system prompt + last 4 messages
    system = [m for m in messages if m["role"] == "system"]
    recent = messages[-4:]

    # Summarize the middle
    middle = messages[len(system):-4]
    summary = await llm_summarize(middle, model="gemini-2.5-flash")

    return system + [{"role": "assistant",
                      "content": f"[Summary: {summary}]"}] + recent

5. Batch Requests

OpenAI's Batch API and Anthropic's Message Batches API offer 50% discounts for asynchronous workloads with up to 24-hour turnaround. If you're running nightly enrichment jobs, document processing pipelines, or evaluation runs, there is no reason not to use batch mode.

# Anthropic Batch API — 50% off regular pricing
import anthropic

client = anthropic.Anthropic()

batch = client.messages.batches.create(
    requests=[
        {"custom_id": f"req-{i}", "params": {
            "model": "claude-sonnet-4-6",
            "max_tokens": 256,
            "messages": [{"role": "user", "content": doc}]
        }}
        for i, doc in enumerate(documents)
    ]
)

6. Output Length Control

Models tend to be verbose by default. Explicitly setting max_tokens and including instructions like "Reply in under 100 words" or "Return JSON only, no explanation" can cut output token costs by 20–40%.

# Instead of open-ended output:
# "Analyze this customer complaint and tell me what to do"

# Use constrained output:
system = """Classify customer complaint. Return JSON only:
{"category": "billing|technical|general",
 "priority": "high|medium|low",
 "action": "string max 15 words"}"""

7. Task-Appropriate Model Selection

Use the cheapest model that reliably handles your task. Here's a practical hierarchy for 2026:

Gemini 2.5 Flash ($0.15/1M in) — classification, extraction, simple Q&A, translation
Claude Haiku 4.5 (~$0.80/1M in) — summarization, formatting, short-form generation
GPT-5.4 ($2.50/1M in) — complex coding, nuanced writing, multi-step reasoning
Claude Sonnet 4.6 ($3.00/1M in) — long-context analysis, coding, agentic tasks

8. Embeddings Optimization

If you're using embeddings for RAG (retrieval-augmented generation), the embedding model choice matters significantly. OpenAI's text-embedding-3-small at $0.02/1M tokens is 5x cheaper than text-embedding-3-large at $0.13/1M tokens, with only marginal quality loss for most retrieval tasks. Also cache embeddings aggressively — documents rarely change.

# Cache embeddings in SQLite to avoid re-computing
import sqlite3, json
from openai import OpenAI

def get_or_compute_embedding(text: str, client: OpenAI) -> list:
    h = hashlib.sha256(text.encode()).hexdigest()
    row = db.execute("SELECT vec FROM emb_cache WHERE hash=?", (h,)).fetchone()
    if row:
        return json.loads(row[0])

    resp = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"  # $0.02/1M — not $0.13/1M
    )
    vec = resp.data[0].embedding
    db.execute("INSERT INTO emb_cache VALUES (?,?)", (h, json.dumps(vec)))
    return vec

9. Streaming for UX, Timeouts for Cost

Streaming itself doesn't reduce token costs — you pay for the same tokens either way. But it does let you implement early stopping: if the user navigates away or the output becomes clearly wrong, you can abort the stream and avoid paying for unneeded output tokens. Set a hard max_tokens budget per request to prevent runaway outputs.

10. Retry Logic with Exponential Backoff and Jitter

Naive retry loops can cause thundering herd problems where all retries hit simultaneously, burning tokens on requests that fail again. Proper backoff with jitter prevents this. Also, only retry on retriable errors (429 rate limit, 500 server error) — not on 400 validation errors, which you'll never recover from automatically.

import time, random

def call_with_retry(fn, max_retries=4):
    for attempt in range(max_retries):
        try:
            return fn()
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)
        except InvalidRequestError:
            raise  # Don't retry client errors

Putting It All Together

Combining these techniques is multiplicative, not additive. A team using model routing (60% savings) + prompt compression (30% savings) + semantic caching (40% cache hit rate) achieves roughly:

Effective cost = base_cost × (1 - 0.60) × (1 - 0.30) × (1 - 0.40) ≈ 16% of original

That's an 84% cost reduction — and the numbers above are conservative. Start with model routing for highest leverage, then add semantic caching, then compress prompts. Each step is measurable with a token counter like Tokenia.

Pro tip: Before optimizing, measure first. Paste your actual prompts into Tokenia to see exactly how many tokens each prompt consumes and what it costs per model.

Count your tokens before optimizing

Paste any prompt into Tokenia to see token counts and costs across GPT-5.4, Claude, Gemini, and 100+ other models — instantly, for free, with no sign-up.

Try Tokenia Free →