Also in: Español Português

10 Token-Saving Prompting Techniques for AI Developers

May 31, 2026 · Tokenia Team · 9 min read

Every token you send to an LLM API is a billable unit. At GPT-5.4 prices of $2.50 per million input tokens, a bloated 200-token system prompt repeated across 500,000 daily requests costs an extra $250 per month — for tokens that do nothing useful. This guide gives you ten concrete techniques you can implement today, each with a before/after example you can verify in Tokenia.

Remove Filler Words and Politeness Padding

Saves 15–30% on system prompts

Models don't need "please", "kindly", "as an AI language model", or "I'd like you to". These phrases consume tokens without affecting output quality. Cut them ruthlessly.

# BEFORE — 47 tokens
Please carefully analyze the following customer feedback and
kindly provide a detailed sentiment analysis. As a helpful
AI assistant, make sure to be thorough.

# AFTER — 16 tokens
Analyze customer feedback. Return: sentiment (pos/neg/neutral),
key themes, confidence 0-1.

Use Structured Formats Instead of Prose Instructions

Saves 20–40% on instructions

Prose instructions like "Please make sure to always format your response in JSON with the following fields..." are verbose. Showing the schema directly is more concise and often more effective.

# BEFORE — 52 tokens
Please respond to the user's question in JSON format.
The JSON should have a "answer" field containing your response,
a "confidence" field with a number between 0 and 1, and a
"sources" array if you have any relevant sources.

# AFTER — 18 tokens
Reply JSON: {"answer":"...","confidence":0.0-1.0,"sources":[]}

Use Abbreviations in System Prompts

Saves 10–20% on repeated system prompts

In long system prompts, define abbreviations once and use them throughout. "UI" instead of "user interface", "req" instead of "requirement", "KB" instead of "knowledge base". Models understand abbreviations perfectly.

# BEFORE — 38 tokens
When the user submits a support ticket, check the knowledge base
for relevant documentation. If the knowledge base contains an
answer, provide it with a link to the knowledge base article.

# AFTER — 23 tokens
On support ticket: check KB for relevant docs.
If KB match found: answer + link KB article.

Compress Context with Rolling Summaries

Saves 40–70% on long conversations

As conversations grow, you re-send the entire history on every turn. After 6–8 turns, replace older messages with a compact summary. Use a cheap model (Gemini Flash) for summarization to minimize the cost of compression itself.

async def compress_history(messages, max_tokens=1500):
    if count_tokens(messages) < max_tokens:
        return messages

    # Summarize all but last 3 user/assistant pairs
    to_summarize = messages[1:-6]  # keep system + last 3 turns
    summary_prompt = f"Summarize this conversation in ≤100 words: {to_summarize}"

    summary = await call_llm(summary_prompt, model="gemini-2.5-flash")

    return [
        messages[0],  # system prompt
        {"role": "assistant", "content": f"[Prior context: {summary}]"},
        *messages[-6:]  # last 3 turns
    ]

Use Few-Shot Examples Efficiently

Saves 30–50% vs verbose examples

Few-shot examples are powerful but expensive if they're bloated. Use the shortest examples that still demonstrate the pattern. Remove explanatory text around the examples — the model infers the pattern from the input/output pairs alone.

# BEFORE — 78 tokens
Here are some examples of how to classify customer messages:
Example 1: When a customer says "My order hasn't arrived yet"
this should be classified as a "delivery" issue.
Example 2: When a customer says "I was charged twice"
this should be classified as "billing".

# AFTER — 28 tokens
Classify message type.
"My order hasn't arrived" → delivery
"I was charged twice" → billing
"{message}" →

Split Complex Prompts into Smaller Calls

Reduces error cost, enables model routing

A single complex prompt that asks for 5 things often gets worse results than 5 focused prompts. More importantly, you can route sub-tasks to cheaper models. Extract → summarize → classify → generate can each be handled by the cheapest capable model.

# Instead of one expensive call that does everything:
# "Extract entities, classify sentiment, generate response,
#  translate to Spanish, and check for policy violations"

# Break it into routed calls:
entities   = await call(extract_prompt, model="gemini-flash")  # cheap
sentiment  = await call(classify_prompt, model="gemini-flash")  # cheap
response   = await call(generate_prompt, model="gpt-4o")        # expensive
spanish    = await call(translate_prompt, model="gemini-flash")  # cheap
safe       = await call(safety_prompt, model="gemini-flash")     # cheap

Trim Whitespace and Blank Lines

Saves 2–8% with zero effort

Blank lines, trailing spaces, and excessive indentation all tokenize as separate tokens. It's a small savings, but it costs nothing to implement and adds up across millions of calls.

import re

def compress_whitespace(prompt: str) -> str:
    # Remove trailing spaces on each line
    prompt = re.sub(r'[ \t]+$', '', prompt, flags=re.MULTILINE)
    # Collapse 2+ blank lines into one
    prompt = re.sub(r'\n{3,}', '\n\n', prompt)
    # Strip leading/trailing whitespace
    return prompt.strip()

# Example savings:
before = "  You are a helpful   \n\n\n\n  assistant.  \n  "
after  = compress_whitespace(before)
# before: ~12 tokens  →  after: ~6 tokens

Use Model-Specific Optimizations

Saves 10–25% with model-native features

Each model has features designed to reduce token overhead. Using them correctly means you get the same quality for fewer tokens:

OpenAI: Use response_format: {"type": "json_object"} instead of prompting for JSON — the model is more likely to produce compact valid JSON
Anthropic: Use XML tags (<instructions>, <context>) for clearer structure that requires fewer explanatory words
Gemini: Use generationConfig.responseMimeType: "application/json" for direct JSON output without prose wrapping

Cache System Prompts with Anthropic Prompt Caching

Saves 90% on repeated system prompt costs

Anthropic's prompt caching marks part of your prompt as cacheable. On subsequent requests, cached portions are read at $0.30/1M tokens instead of $3.00/1M — a 90% reduction. This is especially valuable for long system prompts or large knowledge base injections.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": long_system_prompt,  # e.g. 4000 tokens
            "cache_control": {"type": "ephemeral"}  # cache for 5 min
        }
    ],
    messages=[{"role": "user", "content": user_message}]
)
# First call: writes cache (full price)
# Calls 2-N: reads cache ($0.30/1M vs $3.00/1M)

Measure Before and After with a Token Counter

Essential for tracking actual impact

All the techniques above are only useful if you measure their effect. Different models tokenize text differently — a prompt that's 150 tokens for GPT-4o might be 180 tokens for Claude. Before deploying any prompt optimization, verify the actual token counts.

# Use tiktoken for OpenAI-compatible counting:
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

before_tokens = len(enc.encode(before_prompt))
after_tokens  = len(enc.encode(after_prompt))

print(f"Before: {before_tokens} tokens (${before_tokens/1e6 * 2.50:.6f}/call)")
print(f"After:  {after_tokens} tokens (${after_tokens/1e6 * 2.50:.6f}/call)")
print(f"Savings: {(before_tokens - after_tokens)/before_tokens:.0%}")

Or paste your prompts directly into Tokenia for instant cross-model token counts without writing any code.

Summary: Expected Savings by Technique

Technique	Typical Savings	Implementation Effort
Remove filler words	15–30%	Minutes
Structured formats	20–40%	Minutes
Abbreviations	10–20%	Minutes
Rolling summaries	40–70%	1–2 hours
Efficient few-shot	30–50%	30 minutes
Split complex prompts	Variable + quality gains	2–4 hours
Trim whitespace	2–8%	5 minutes
Model-specific features	10–25%	1 hour
Prompt caching (Anthropic)	Up to 90% on system prompt	30 minutes
Measure with token counter	Validates all above	Ongoing

Measure your savings with Tokenia

Paste your before and after prompts into Tokenia to see the exact token count difference and cost savings across every major model — free, instant, no sign-up.

Try Tokenia Free →