Token Usage Patterns and Optimization Techniques

— Tokens are the currency of LLM systems—understanding how they work and optimizing their usage can dramatically reduce costs and improve performance.

level: intermediate topics: cost-optimization, performance, tokens tags: tokens, optimization, cost, efficiency

Every LLM API charges by the token. Input tokens, output tokens, cached tokens—understanding token mechanics is fundamental to building cost-effective and performant systems.

But tokens aren’t just about money. They’re about context limits, latency, and system design. Optimize token usage, and you optimize everything downstream.

What Tokens Actually Are

Tokens aren’t words. They’re sub-word units that the model uses internally.

Common words = single tokens: “the”, “is”, “hello” are each one token.

Uncommon words = multiple tokens: “extraordinary” might be 3-4 tokens depending on the tokenizer.

Whitespace and punctuation count: Spaces, newlines, commas—everything counts as tokens.

Different languages have different token counts: English is token-efficient. Languages with non-Latin scripts (Chinese, Arabic) use more tokens per word.

Special characters are expensive: JSON, code, or formatted text often requires more tokens than plain prose.

You can’t optimize what you don’t measure. Use tokenizer tools (like OpenAI’s tiktoken) to count tokens before sending prompts.

Where Tokens Come From

Every LLM request has token costs:

System prompt: The instructions you give the model. Repeated on every request if not cached.

User input: The query or message from the user.

Conversation history: Previous messages in a chat interface.

Retrieved context: Documents, search results, or database records you include.

Few-shot examples: Example inputs and outputs you provide for in-context learning.

Output tokens: The response the model generates.

Each of these can be optimized independently.

Optimization 1: Compress System Prompts

System prompts are often verbose and repeated on every request.

Cut unnecessary words: “Please assist the user in finding information about…” → “Help users find information about…”

Use abbreviations: If the model understands, “Q:” and “A:” work as well as “Question:” and “Answer:”.

Remove examples that don’t help: If you included 10 examples but 3 would suffice, remove the extras.

Test aggressively: Remove sections, test quality. If quality holds, keep it shorter.

A 500-token system prompt costs more than a 200-token one across thousands of requests.

Optimization 2: Intelligent Context Inclusion

Don’t include everything—include what’s relevant.

Retrieval ranking: When doing RAG, retrieve 20 documents but only include the top 3 most relevant ones.

Summarization: If retrieved documents are long, summarize them before including. The model doesn’t need every sentence.

Progressive context: Start with minimal context. If the model needs more, add it in follow-up queries.

Sliding windows for conversation: Keep the last N messages, not the entire conversation history. Older messages contribute less value and waste tokens.

Optimization 3: Smart Conversation History Management

Multi-turn conversations accumulate history quickly.

Truncation strategies: Keep the most recent N messages. Drop older ones.

Summarization: Summarize earlier parts of the conversation. “Previously, the user asked about X and you explained Y.”

Selective retention: Keep critical messages (user intents, key decisions) and drop filler (greetings, confirmations).

Reset on topic changes: If the conversation shifts topics, clear history and start fresh.

A 20-turn conversation can easily exceed 10,000 tokens if you keep full history. Truncating or summarizing keeps it manageable.

Optimization 4: Few-Shot Example Selection

Examples help the model learn tasks, but they’re expensive.

Use fewer examples: Test if 2 examples work as well as 10. Often they do.

Shorter examples: Make examples concise. Long examples waste tokens without adding much signal.

Dynamic examples: Pick examples most relevant to the current query. Don’t always send the same 5 examples.

Zero-shot when possible: Modern models are good at zero-shot tasks. Test if examples are even necessary.

Optimization 5: Output Length Control

You can’t perfectly control output length, but you can influence it.

Max tokens parameter: Set strict limits. If you need a summary, cap at 150 tokens.

Explicit instructions: “Respond in 3 sentences” or “Keep it under 100 words.”

Stop sequences: Define stop tokens so the model stops early when appropriate.

Penalty parameters: Some APIs let you penalize length. Encourage brevity with parameter tuning.

Shorter outputs save money and reduce latency.

Tokenization Gotchas

Some text formats are token-inefficient.

JSON: Lots of punctuation and special characters. JSON responses often use 30-50% more tokens than plain text for the same information.

Code: Indentation, syntax, special characters—code is token-heavy.

Markdown: Formatting characters (**, ##, -) add tokens.

Solution: If you don’t need formatting, use plain text. If you need structure, consider more token-efficient formats (YAML is slightly better than JSON in some cases).

Token Counting and Monitoring

Track token usage to identify waste.

Log tokens per request: Input tokens, output tokens, total tokens. Aggregate to understand patterns.

Identify high-token features: Which parts of your app use the most tokens? Optimize those first.

Detect anomalies: Sudden spikes in token usage might indicate bugs (infinite loops, runaway generation, prompt injection attacks).

Cost attribution: Break down costs by feature, user, or query type. This informs prioritization.

Chunking Strategies for Long Documents

When you must process long documents, chunk intelligently.

Semantic chunking: Split at natural boundaries (paragraphs, sections) rather than arbitrary token counts.

Overlapping chunks: Include some overlap so context isn’t lost at chunk boundaries.

Hierarchical processing: Summarize each chunk, then summarize the summaries. This uses more total tokens but fits within context limits.

Map-reduce patterns: Process chunks in parallel, combine results. Faster but more expensive.

Provider-Specific Optimizations

Different providers have different token economics.

Prompt caching: Some providers cache long prefixes. Structure prompts so the expensive parts (system prompt, long context) are in the prefix and the variable parts (user query) are in the suffix.

Batching: Batch APIs charge less per token but have higher latency. Use for non-real-time workloads.

Model-specific tokenizers: GPT-4 and GPT-3.5 use the same tokenizer, but Claude, Gemini, and open-source models use different ones. Token counts vary across providers for the same text.

The Context Window Trade-off

Larger context windows allow more tokens but don’t always improve quality.

Dilution effect: Models sometimes struggle with extremely long contexts. Key information gets “lost in the middle.”

Cost-context curve: Doubling context length doubles input costs but might not double quality.

Smart truncation: Keep the most relevant tokens. Drop low-value content.

Alternative architectures: For very long documents, consider multiple passes with smaller contexts rather than one pass with a huge context.

Token Budgeting

Set token budgets for different use cases.

Interactive chat: Budget 1,500 input tokens + 500 output tokens per request. RAG summarization: Budget 3,000 input tokens + 300 output tokens. Report generation: Budget 2,000 input tokens + 2,000 output tokens.

When you exceed budgets, decide: compress input, truncate output, or accept higher cost for this specific use case.

Testing Token Optimizations

Don’t guess—measure.

Before/after comparisons: Measure token usage before optimization, implement changes, measure after. Quantify savings.

Quality checks: Ensure optimizations don’t degrade output quality. Use eval sets or A/B tests.

Cost modeling: Project token savings across your request volume. A 20% reduction in tokens = 20% lower bill.

What Good Looks Like

A token-optimized system:

  • Counts tokens explicitly (doesn’t estimate or guess)
  • Compresses system prompts ruthlessly
  • Includes only relevant context (not everything)
  • Manages conversation history intelligently (truncation or summarization)
  • Sets output length limits
  • Monitors token usage and identifies high-usage patterns
  • Tests optimizations to ensure quality holds

Tokens are the atomic unit of LLM economics. Every token saved is money saved and latency reduced.

Measure token usage, optimize systematically, and always test that quality holds. You’ll be shocked by how much waste you can eliminate without users noticing.