Nova Labs is currently on pause. New product purchases are unavailable. The blog remains live as an archive of the experiment.
Back to blog

The complete guide to Claude Code cost optimization (2026)

April 7, 2026 18 min read

Claude Code is the most capable coding AI available right now. It is also, if you are not careful, one of the most expensive. Whether you are on a $200/month Max subscription hitting rate limits daily, or paying per token on the API and watching the bill climb, there are concrete steps you can take to cut costs without losing output quality.

This guide covers twelve strategies I have tested across hundreds of sessions running a production business entirely on Claude Code. Most of them take less than five minutes to implement. Combined, they can reduce your effective cost by 30-60%.

Use the free CostPilot analyzer to measure your current baseline before applying these changes. Upload your session JSONL files and see exactly where tokens go. Then come back after a week and measure again.

Part 1: Context management (the biggest lever)

Context loading is where most people waste the most tokens without realizing it. Every session starts by loading your project files, and every message carries the full conversation history. Small changes here compound across hundreds of sessions.

Strategy 1: Keep your CLAUDE.md under 500 words

Your CLAUDE.md file loads on every single session. If it is 2,000 words, that is roughly 3,000 tokens consumed before you type anything. Multiply by 20 sessions per day and you are burning 60,000 tokens daily just on instructions.

The fix: treat CLAUDE.md like a routing table, not a manual. State the rules in short, direct sentences. Move detailed reference material to separate files that Claude loads on demand. A 500-word CLAUDE.md that points to a context/ directory with detailed docs loads 4x fewer tokens per session than a monolithic file.

Before: A 2,500-word CLAUDE.md with coding standards, project history, API docs, and deployment instructions all inline. ~3,750 tokens per session load.

After: A 400-word CLAUDE.md with rules and pointers. context/coding-standards.md, context/api-docs.md, context/deploy.md loaded only when relevant. ~600 tokens per session load, plus the specific file when needed.

Savings: ~63,000 tokens/day at 20 sessions. That is real money on API billing and real quota on subscription plans.

Strategy 2: Use tiered context loading

Not all context is relevant to every task. If Claude is fixing a CSS bug, it does not need your database migration docs. Tiered loading means organizing your context into levels:

  • Tier 1 (always loaded): CLAUDE.md with core rules and a compact index of available docs
  • Tier 2 (loaded by task type): Frontend docs for UI work, backend docs for API work, deployment docs for infra work
  • Tier 3 (loaded on demand): Detailed API references, historical decisions, meeting notes

Add a context/INDEX.md file that lists what is available and when to load each file. Claude is smart enough to follow this routing. The result: each session loads only the 20-30% of context that is actually relevant.

Strategy 3: Start new sessions for new tasks

Every message in a conversation includes the full history. By message 15, each exchange carries 15 previous messages as input tokens. By message 30, you are spending more on history than on the actual work.

The rule: one task, one session. Finished debugging that API endpoint? Start a fresh session for the next task. The context loading cost of a new session is almost always less than carrying 20 messages of irrelevant history.

Exception: multi-step tasks where each step builds on the last (like a complex refactor) benefit from staying in the same session because the context is genuinely relevant.

Part 2: Model routing (pay only for what you need)

Strategy 4: Use Sonnet for mechanical tasks

Opus is 5x more expensive than Sonnet per token. For many tasks, Sonnet produces identical results. The key is knowing which tasks need Opus-level reasoning and which ones are mechanical.

Good for Sonnet: File renames, simple refactors, boilerplate generation, test writing from existing patterns, documentation updates, code formatting, straightforward bug fixes with clear error messages.

Keep on Opus: Complex architectural decisions, debugging subtle race conditions, implementing novel algorithms, security reviews, planning multi-file changes, tasks requiring deep codebase understanding.

If you are using Claude Code skills or subagents, add model: sonnet to the frontmatter of mechanical skills. This routes them to the cheaper model automatically.

Strategy 5: Use Haiku for validation and formatting

Haiku costs 25x less than Opus. For validation tasks (checking JSON syntax, verifying file paths exist, formatting output), it is more than capable.

Real example: a content review workflow that uses Opus for the initial draft, Sonnet for editing, and Haiku for final spell-check and format validation. Total cost per piece drops by roughly 40% compared to running everything on Opus.

Strategy 6: Route subagents to cheaper models

If you use Claude Code subagents (via the Agent tool), each subagent can run on a different model. The parent agent on Opus orchestrates the work, while child agents on Sonnet or Haiku handle the execution.

Pattern: Opus plans the work and breaks it into tasks. Sonnet subagents execute each task independently. Opus reviews the combined results. This "orchestrator-worker" pattern can cut the total token cost of complex workflows by 50-70%.

Part 3: Caching (free performance)

Strategy 7: Maximize cache hit rates

Anthropic caches prompt prefixes. If the beginning of your prompt is identical across messages (which it is, because CLAUDE.md and rules files load first), those tokens are served from cache at a 90% discount. This happens automatically, but you can help it.

Keys to high cache rates:

  • Keep static content (rules, context) at the beginning of the prompt
  • Put dynamic content (user messages, file reads) at the end
  • Avoid randomizing the order of loaded files between sessions
  • Do not modify CLAUDE.md frequently. Every change invalidates the cache for that prefix

A well-structured project with stable CLAUDE.md and consistent context loading can hit 70-85% cache rates. A messy project with frequently changing instructions might see 20-30%. The difference is substantial: at Opus rates, 80% cache vs 30% cache on a 50,000-token context saves roughly $0.50 per session.

Check your cache hit rate with the free CostPilot analyzer. Upload a few session files and look at the cache breakdown. If you are below 60%, your context structure needs work.

Strategy 8: Batch similar tasks in sequence

Cache prefixes persist for a short time between sessions. If you run three frontend tasks back to back, the context from the first session is likely still cached when the second starts. If you alternate between frontend and backend tasks, each switch potentially invalidates the cache.

Practical approach: group your tasks by domain. Do all the frontend work in one block, then switch to backend. Do all your documentation updates together. This natural batching improves cache hit rates without any configuration changes.

Part 4: Session hygiene (small habits, big savings)

Strategy 9: Be specific in your prompts

Vague prompts cost more. "Fix the bug" makes Claude explore multiple files, read code it does not need, and generate exploratory output. "Fix the null pointer in src/auth/login.ts line 42" sends Claude directly to the problem.

Specific prompts reduce:

  • Tool call tokens (fewer file reads and searches)
  • Output tokens (focused response instead of exploratory analysis)
  • Retry tokens (less likely to go in the wrong direction)

This is not about prompt engineering. It is about giving Claude the information it needs to avoid wasting work. File paths, line numbers, function names, expected behavior, actual behavior. The more context in your prompt, the fewer tokens Claude spends discovering that context.

Strategy 10: Use glob and grep before asking Claude to search

When Claude searches your codebase, every file it reads counts as input tokens. A broad search across a large project can consume tens of thousands of tokens just in file reads. If you already know roughly where the relevant code lives, tell Claude directly.

Instead of: "Find all the places where we handle authentication"

Try: "Look at src/auth/ and src/middleware/auth.ts for the authentication handling code"

The targeted version might consume 2,000 tokens in file reads. The broad version might consume 20,000+ while Claude opens and scans multiple directories.

Strategy 11: Avoid re-reading files Claude already has in context

If Claude just read a file two messages ago, it is still in the conversation history. Asking it to "read that file again" or triggering another read of the same file adds duplicate tokens. This happens more than you think, especially when using tools that automatically re-read files before editing.

If you notice Claude reading the same files repeatedly, restructure your request so it works with what is already in context. Reference specific line numbers or function names from the earlier read instead of asking for a fresh read.

Strategy 12: Clean up stale conversations

Old conversations that you revisit carry their full history. A conversation from last week with 50 messages costs you 50 messages worth of tokens every time you continue it. If you need to return to an old topic, it is usually cheaper to start fresh and paste in the specific context you need.

Putting it all together: a cost-optimized workflow

Here is what a cost-optimized Claude Code setup looks like in practice:

  1. CLAUDE.md is under 500 words. Detailed docs live in context/ with an index file.
  2. Each task gets its own session. Sessions are closed after completing the task.
  3. Tasks are grouped by domain (frontend block, backend block, docs block).
  4. Mechanical skills and subagents route to Sonnet or Haiku.
  5. Prompts include file paths, line numbers, and specific instructions.
  6. Cache hit rates stay above 60% (checked monthly with CostPilot).

Teams and individuals who follow these practices consistently report 30-60% lower effective costs compared to using Claude Code with default settings and habits.

Measure what you manage

None of these strategies matter if you cannot see their impact. The biggest reason people overspend on Claude Code is not that optimization is hard. It is that they have no visibility into where tokens go.

The CostPilot free analyzer gives you that visibility in under two minutes. Upload your session JSONL files from ~/.claude/projects/ and get an instant breakdown of token usage, cache rates, model costs, and waste patterns. No account needed, everything runs in your browser.

If you want ongoing monitoring with historical trends, budget alerts, and automated optimization suggestions, join the CostPilot Pro waitlist. We are building the cost intelligence layer that Claude Code is missing.

Quick reference: optimization cheat sheet

Strategy Effort Savings
Trim CLAUDE.md to 500 words 15 min 10-20%
Tiered context loading 30 min 15-25%
One task per session Habit 10-30%
Route to Sonnet/Haiku 5 min/skill 30-60%
Maximize cache hits 20 min 10-15%
Batch similar tasks Habit 5-10%
Specific prompts Habit 10-20%
Direct file references Habit 5-15%

Savings are not additive (you cannot save 200% total), but combining the top four strategies typically yields 30-50% reduction in effective cost. Start with trimming CLAUDE.md and routing models. Those give the biggest return for the least effort.

Want to build your own AI OS?

The AI OS Blueprint gives you the complete system: 53-page playbook, working skills, and a clonable repo. Starting at $47.

30-day money-back guarantee. No subscription.