10 Token-Saving Prompting Techniques for AI Developers
Every token you send to an LLM API is a billable unit. At GPT-4o prices of $2.50 per million input tokens, a bloated 200-token system prompt repeated across 500,000 daily requests costs an extra $250 per month — for tokens that do nothing useful. This guide gives you ten concrete techniques you can implement today, each with a before/after example you can verify in Tokenia.
Remove Filler Words and Politeness Padding
Models don't need "please", "kindly", "as an AI language model", or "I'd like you to". These phrases consume tokens without affecting output quality. Cut them ruthlessly.
# BEFORE — 47 tokens
Please carefully analyze the following customer feedback and
kindly provide a detailed sentiment analysis. As a helpful
AI assistant, make sure to be thorough.
# AFTER — 16 tokens
Analyze customer feedback. Return: sentiment (pos/neg/neutral),
key themes, confidence 0-1.
Use Structured Formats Instead of Prose Instructions
Prose instructions like "Please make sure to always format your response in JSON with the following fields..." are verbose. Showing the schema directly is more concise and often more effective.
# BEFORE — 52 tokens
Please respond to the user's question in JSON format.
The JSON should have a "answer" field containing your response,
a "confidence" field with a number between 0 and 1, and a
"sources" array if you have any relevant sources.
# AFTER — 18 tokens
Reply JSON: {"answer":"...","confidence":0.0-1.0,"sources":[]}
Use Abbreviations in System Prompts
In long system prompts, define abbreviations once and use them throughout. "UI" instead of "user interface", "req" instead of "requirement", "KB" instead of "knowledge base". Models understand abbreviations perfectly.
# BEFORE — 38 tokens
When the user submits a support ticket, check the knowledge base
for relevant documentation. If the knowledge base contains an
answer, provide it with a link to the knowledge base article.
# AFTER — 23 tokens
On support ticket: check KB for relevant docs.
If KB match found: answer + link KB article.
Compress Context with Rolling Summaries
As conversations grow, you re-send the entire history on every turn. After 6–8 turns, replace older messages with a compact summary. Use a cheap model (Gemini Flash) for summarization to minimize the cost of compression itself.
async def compress_history(messages, max_tokens=1500):
if count_tokens(messages) < max_tokens:
return messages
# Summarize all but last 3 user/assistant pairs
to_summarize = messages[1:-6] # keep system + last 3 turns
summary_prompt = f"Summarize this conversation in ≤100 words: {to_summarize}"
summary = await call_llm(summary_prompt, model="gemini-2.5-flash")
return [
messages[0], # system prompt
{"role": "assistant", "content": f"[Prior context: {summary}]"},
*messages[-6:] # last 3 turns
]
Use Few-Shot Examples Efficiently
Few-shot examples are powerful but expensive if they're bloated. Use the shortest examples that still demonstrate the pattern. Remove explanatory text around the examples — the model infers the pattern from the input/output pairs alone.
# BEFORE — 78 tokens
Here are some examples of how to classify customer messages:
Example 1: When a customer says "My order hasn't arrived yet"
this should be classified as a "delivery" issue.
Example 2: When a customer says "I was charged twice"
this should be classified as "billing".
# AFTER — 28 tokens
Classify message type.
"My order hasn't arrived" → delivery
"I was charged twice" → billing
"{message}" →
Split Complex Prompts into Smaller Calls
A single complex prompt that asks for 5 things often gets worse results than 5 focused prompts. More importantly, you can route sub-tasks to cheaper models. Extract → summarize → classify → generate can each be handled by the cheapest capable model.
# Instead of one expensive call that does everything:
# "Extract entities, classify sentiment, generate response,
# translate to Spanish, and check for policy violations"
# Break it into routed calls:
entities = await call(extract_prompt, model="gemini-flash") # cheap
sentiment = await call(classify_prompt, model="gemini-flash") # cheap
response = await call(generate_prompt, model="gpt-4o") # expensive
spanish = await call(translate_prompt, model="gemini-flash") # cheap
safe = await call(safety_prompt, model="gemini-flash") # cheap
Trim Whitespace and Blank Lines
Blank lines, trailing spaces, and excessive indentation all tokenize as separate tokens. It's a small savings, but it costs nothing to implement and adds up across millions of calls.
import re
def compress_whitespace(prompt: str) -> str:
# Remove trailing spaces on each line
prompt = re.sub(r'[ \t]+$', '', prompt, flags=re.MULTILINE)
# Collapse 2+ blank lines into one
prompt = re.sub(r'\n{3,}', '\n\n', prompt)
# Strip leading/trailing whitespace
return prompt.strip()
# Example savings:
before = " You are a helpful \n\n\n\n assistant. \n "
after = compress_whitespace(before)
# before: ~12 tokens → after: ~6 tokens
Use Model-Specific Optimizations
Each model has features designed to reduce token overhead. Using them correctly means you get the same quality for fewer tokens:
- OpenAI: Use
response_format: {"type": "json_object"}instead of prompting for JSON — the model is more likely to produce compact valid JSON - Anthropic: Use XML tags (
<instructions>,<context>) for clearer structure that requires fewer explanatory words - Gemini: Use
generationConfig.responseMimeType: "application/json"for direct JSON output without prose wrapping
Cache System Prompts with Anthropic Prompt Caching
Anthropic's prompt caching marks part of your prompt as cacheable. On subsequent requests, cached portions are read at $0.30/1M tokens instead of $3.00/1M — a 90% reduction. This is especially valuable for long system prompts or large knowledge base injections.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": long_system_prompt, # e.g. 4000 tokens
"cache_control": {"type": "ephemeral"} # cache for 5 min
}
],
messages=[{"role": "user", "content": user_message}]
)
# First call: writes cache (full price)
# Calls 2-N: reads cache ($0.30/1M vs $3.00/1M)
Measure Before and After with a Token Counter
All the techniques above are only useful if you measure their effect. Different models tokenize text differently — a prompt that's 150 tokens for GPT-4o might be 180 tokens for Claude. Before deploying any prompt optimization, verify the actual token counts.
# Use tiktoken for OpenAI-compatible counting:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
before_tokens = len(enc.encode(before_prompt))
after_tokens = len(enc.encode(after_prompt))
print(f"Before: {before_tokens} tokens (${before_tokens/1e6 * 2.50:.6f}/call)")
print(f"After: {after_tokens} tokens (${after_tokens/1e6 * 2.50:.6f}/call)")
print(f"Savings: {(before_tokens - after_tokens)/before_tokens:.0%}")
Or paste your prompts directly into Tokenia for instant cross-model token counts without writing any code.
Summary: Expected Savings by Technique
| Technique | Typical Savings | Implementation Effort |
|---|---|---|
| Remove filler words | 15–30% | Minutes |
| Structured formats | 20–40% | Minutes |
| Abbreviations | 10–20% | Minutes |
| Rolling summaries | 40–70% | 1–2 hours |
| Efficient few-shot | 30–50% | 30 minutes |
| Split complex prompts | Variable + quality gains | 2–4 hours |
| Trim whitespace | 2–8% | 5 minutes |
| Model-specific features | 10–25% | 1 hour |
| Prompt caching (Anthropic) | Up to 90% on system prompt | 30 minutes |
| Measure with token counter | Validates all above | Ongoing |
Measure your savings with Tokenia
Paste your before and after prompts into Tokenia to see the exact token count difference and cost savings across every major model — free, instant, no sign-up.
Try Tokenia Free →