10 Token-Saving Prompting Techniques for AI Developers

Every token you send to an LLM API is a billable unit. At GPT-4o prices of $2.50 per million input tokens, a bloated 200-token system prompt repeated across 500,000 daily requests costs an extra $250 per month — for tokens that do nothing useful. This guide gives you ten concrete techniques you can implement today, each with a before/after example you can verify in Tokenia.

01

Remove Filler Words and Politeness Padding

Saves 15–30% on system prompts

Models don't need "please", "kindly", "as an AI language model", or "I'd like you to". These phrases consume tokens without affecting output quality. Cut them ruthlessly.

# BEFORE — 47 tokens
Please carefully analyze the following customer feedback and
kindly provide a detailed sentiment analysis. As a helpful
AI assistant, make sure to be thorough.

# AFTER — 16 tokens
Analyze customer feedback. Return: sentiment (pos/neg/neutral),
key themes, confidence 0-1.
02

Use Structured Formats Instead of Prose Instructions

Saves 20–40% on instructions

Prose instructions like "Please make sure to always format your response in JSON with the following fields..." are verbose. Showing the schema directly is more concise and often more effective.

# BEFORE — 52 tokens
Please respond to the user's question in JSON format.
The JSON should have a "answer" field containing your response,
a "confidence" field with a number between 0 and 1, and a
"sources" array if you have any relevant sources.

# AFTER — 18 tokens
Reply JSON: {"answer":"...","confidence":0.0-1.0,"sources":[]}
03

Use Abbreviations in System Prompts

Saves 10–20% on repeated system prompts

In long system prompts, define abbreviations once and use them throughout. "UI" instead of "user interface", "req" instead of "requirement", "KB" instead of "knowledge base". Models understand abbreviations perfectly.

# BEFORE — 38 tokens
When the user submits a support ticket, check the knowledge base
for relevant documentation. If the knowledge base contains an
answer, provide it with a link to the knowledge base article.

# AFTER — 23 tokens
On support ticket: check KB for relevant docs.
If KB match found: answer + link KB article.
04

Compress Context with Rolling Summaries

Saves 40–70% on long conversations

As conversations grow, you re-send the entire history on every turn. After 6–8 turns, replace older messages with a compact summary. Use a cheap model (Gemini Flash) for summarization to minimize the cost of compression itself.

async def compress_history(messages, max_tokens=1500):
    if count_tokens(messages) < max_tokens:
        return messages

    # Summarize all but last 3 user/assistant pairs
    to_summarize = messages[1:-6]  # keep system + last 3 turns
    summary_prompt = f"Summarize this conversation in ≤100 words: {to_summarize}"

    summary = await call_llm(summary_prompt, model="gemini-2.5-flash")

    return [
        messages[0],  # system prompt
        {"role": "assistant", "content": f"[Prior context: {summary}]"},
        *messages[-6:]  # last 3 turns
    ]
05

Use Few-Shot Examples Efficiently

Saves 30–50% vs verbose examples

Few-shot examples are powerful but expensive if they're bloated. Use the shortest examples that still demonstrate the pattern. Remove explanatory text around the examples — the model infers the pattern from the input/output pairs alone.

# BEFORE — 78 tokens
Here are some examples of how to classify customer messages:
Example 1: When a customer says "My order hasn't arrived yet"
this should be classified as a "delivery" issue.
Example 2: When a customer says "I was charged twice"
this should be classified as "billing".

# AFTER — 28 tokens
Classify message type.
"My order hasn't arrived" → delivery
"I was charged twice" → billing
"{message}" →
06

Split Complex Prompts into Smaller Calls

Reduces error cost, enables model routing

A single complex prompt that asks for 5 things often gets worse results than 5 focused prompts. More importantly, you can route sub-tasks to cheaper models. Extract → summarize → classify → generate can each be handled by the cheapest capable model.

# Instead of one expensive call that does everything:
# "Extract entities, classify sentiment, generate response,
#  translate to Spanish, and check for policy violations"

# Break it into routed calls:
entities   = await call(extract_prompt, model="gemini-flash")  # cheap
sentiment  = await call(classify_prompt, model="gemini-flash")  # cheap
response   = await call(generate_prompt, model="gpt-4o")        # expensive
spanish    = await call(translate_prompt, model="gemini-flash")  # cheap
safe       = await call(safety_prompt, model="gemini-flash")     # cheap
07

Trim Whitespace and Blank Lines

Saves 2–8% with zero effort

Blank lines, trailing spaces, and excessive indentation all tokenize as separate tokens. It's a small savings, but it costs nothing to implement and adds up across millions of calls.

import re

def compress_whitespace(prompt: str) -> str:
    # Remove trailing spaces on each line
    prompt = re.sub(r'[ \t]+$', '', prompt, flags=re.MULTILINE)
    # Collapse 2+ blank lines into one
    prompt = re.sub(r'\n{3,}', '\n\n', prompt)
    # Strip leading/trailing whitespace
    return prompt.strip()

# Example savings:
before = "  You are a helpful   \n\n\n\n  assistant.  \n  "
after  = compress_whitespace(before)
# before: ~12 tokens  →  after: ~6 tokens
08

Use Model-Specific Optimizations

Saves 10–25% with model-native features

Each model has features designed to reduce token overhead. Using them correctly means you get the same quality for fewer tokens:

  • OpenAI: Use response_format: {"type": "json_object"} instead of prompting for JSON — the model is more likely to produce compact valid JSON
  • Anthropic: Use XML tags (<instructions>, <context>) for clearer structure that requires fewer explanatory words
  • Gemini: Use generationConfig.responseMimeType: "application/json" for direct JSON output without prose wrapping
09

Cache System Prompts with Anthropic Prompt Caching

Saves 90% on repeated system prompt costs

Anthropic's prompt caching marks part of your prompt as cacheable. On subsequent requests, cached portions are read at $0.30/1M tokens instead of $3.00/1M — a 90% reduction. This is especially valuable for long system prompts or large knowledge base injections.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": long_system_prompt,  # e.g. 4000 tokens
            "cache_control": {"type": "ephemeral"}  # cache for 5 min
        }
    ],
    messages=[{"role": "user", "content": user_message}]
)
# First call: writes cache (full price)
# Calls 2-N: reads cache ($0.30/1M vs $3.00/1M)
10

Measure Before and After with a Token Counter

Essential for tracking actual impact

All the techniques above are only useful if you measure their effect. Different models tokenize text differently — a prompt that's 150 tokens for GPT-4o might be 180 tokens for Claude. Before deploying any prompt optimization, verify the actual token counts.

# Use tiktoken for OpenAI-compatible counting:
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

before_tokens = len(enc.encode(before_prompt))
after_tokens  = len(enc.encode(after_prompt))

print(f"Before: {before_tokens} tokens (${before_tokens/1e6 * 2.50:.6f}/call)")
print(f"After:  {after_tokens} tokens (${after_tokens/1e6 * 2.50:.6f}/call)")
print(f"Savings: {(before_tokens - after_tokens)/before_tokens:.0%}")

Or paste your prompts directly into Tokenia for instant cross-model token counts without writing any code.

Summary: Expected Savings by Technique

TechniqueTypical SavingsImplementation Effort
Remove filler words15–30%Minutes
Structured formats20–40%Minutes
Abbreviations10–20%Minutes
Rolling summaries40–70%1–2 hours
Efficient few-shot30–50%30 minutes
Split complex promptsVariable + quality gains2–4 hours
Trim whitespace2–8%5 minutes
Model-specific features10–25%1 hour
Prompt caching (Anthropic)Up to 90% on system prompt30 minutes
Measure with token counterValidates all aboveOngoing

Measure your savings with Tokenia

Paste your before and after prompts into Tokenia to see the exact token count difference and cost savings across every major model — free, instant, no sign-up.

Try Tokenia Free →