Claude Code rate limits: why you keep hitting them and what to do about it
You are in the middle of a session. Claude Code is reading files, making edits, running tests. Then it stops. "Rate limit exceeded." No countdown timer. No indication of when it comes back. You wait, retry, wait some more.
If this happens to you regularly, you are not alone. Rate limits are the single most common complaint from Claude Code power users. The frustrating part is that Anthropic does not publish exact quota numbers, so you are flying blind.
Here is what we know from running Claude Code in production for over 30 days straight, hitting every limit there is to hit, and figuring out what actually helps.
How Claude Code rate limits work
Claude Code uses token-based rate limiting. Every message you send and every response Claude generates consumes tokens. Your available quota depends on your plan tier:
- Claude Pro ($20/month): The most restrictive. Heavy sessions can burn through the daily allocation in 2-3 hours of active coding.
- Claude Max ($100/month): Significantly more headroom. In our experience, production-level workloads are possible, but sustained heavy use across multiple projects still triggers limits.
- Claude Max ($200/month): The highest tier. In our experience, rate limits exist but individual users rarely hit them during normal work.
- API billing: No hard rate limits in the traditional sense. You pay per token, so the limit is your wallet. There are requests-per-minute caps that matter for automated workloads.
The key detail most people miss: rate limits are not just about total tokens per day. They also apply per minute and per hour. A single session that reads twenty large files in rapid succession can trigger a per-minute cap even if your daily usage is well within bounds.
What actually consumes your quota
Most people think output tokens (Claude writing code) are the expensive part. They are wrong. Input tokens dominate your usage by a factor of 5-10x in a typical session. Here is where they go:
1. Context loading at session start
Every time you start a new Claude Code session, it loads your CLAUDE.md, project rules, relevant files, and any memory files. If your project has a large CLAUDE.md (2,000+ words) or multiple rule files, that is thousands of tokens consumed before you even ask a question.
Multiply that across 20 sessions per day and context loading alone accounts for 20-40% of your daily quota on some plans.
2. File reads
When Claude Code reads a file to understand it or make edits, the entire file content becomes input tokens. A 500-line Python file is roughly 3,000-5,000 tokens. Read ten files in a session and that is 30,000-50,000 tokens just for reading, before any actual work happens.
3. Long conversations
Claude Code maintains your full conversation in context. By message 15 in a session, every new message includes the entire conversation history as input. This is where token usage compounds. The 20th message in a session can cost 10x more tokens than the first message.
4. Tool use overhead
Every tool call (file reads, grep searches, bash commands) adds tokens for the tool definition, the request, and the response. A session with heavy tool use can burn through tokens faster than a conversation-only session, even if the conversation is longer.
Practical changes that reduce rate limit hits
Start new sessions instead of continuing long ones
The single most effective change. After 10-15 messages, start a fresh session. You lose conversation context but gain a clean token slate. For production workloads, we found that sessions longer than 20 messages hit rate limits 3x more often than shorter ones.
Keep CLAUDE.md lean
Every token in your CLAUDE.md is loaded on every session. If it has grown to 3,000+ words, trim it. Move detailed reference material to separate files and only load them when needed. Use tiered loading: a compact index file that points to detailed docs, so Claude only reads what the current task requires.
Route simple tasks to cheaper models
Not every task needs Opus. If you use Claude Code with model routing (via the model: frontmatter in skills or the /model command), sending simple lookups and formatting tasks to Haiku or Sonnet keeps your Opus quota available for complex reasoning.
Batch file reads
Instead of asking Claude to read files one at a time, give it a clear task with all relevant file paths upfront. Claude Code can read multiple files in parallel when it knows what it needs. Sequential reads mean sequential token charges and more round trips against the rate limiter.
Use compact mode when available
Claude Code's /compact command summarizes the conversation so far into a shorter form. This reduces the context size for subsequent messages, effectively resetting your per-message token cost without losing all context.
How to see where your tokens actually go
The core problem with rate limits is visibility. You cannot fix what you cannot see. Claude Code does not show you a live token counter or a breakdown of usage by category.
Your options:
- API usage dashboard: If you are on API billing, Anthropic's console shows token usage per request. Useful but not organized by project or session.
- JSONL session logs: Claude Code writes detailed logs of every session. These files contain the raw data, but parsing 50MB of JSONL by hand is not realistic.
- CostPilot free analyzer: We built a free tool that parses your Claude Code JSONL logs and shows you exactly where tokens go. Per-session breakdown, model split, cache analysis, waste detection. Runs entirely in your browser, nothing leaves your machine.
Knowing where your tokens go is the first step to staying under rate limits. Once you can see that 40% of your daily usage is context loading, you know to trim your CLAUDE.md. Once you see that one project generates 3x the tokens of another, you know where to focus optimization.
When rate limits mean you need a different plan
Sometimes the answer is not optimization but capacity. Here is a rough guide:
- Hitting limits once or twice a week: Optimize first. The tips above will probably solve it.
- Hitting limits daily: You are likely on the wrong plan tier. If you are on Pro, Max $100 is worth the upgrade. If you are on Max $100 and hitting limits daily, the $200 tier or API billing might make more sense.
- Running automated workloads: API billing is almost certainly the right choice. Scheduled tasks, heartbeat systems, and multi-agent setups blow through subscription limits fast.
The math depends on your usage pattern. Our free analyzer can help you estimate whether API or subscription is cheaper for your workload.
The bottom line
Rate limits are a token budget problem, not a speed problem. The solution is knowing where your tokens go and making deliberate choices about what gets loaded into context and when. Short sessions, lean context files, model routing, and usage visibility are the four levers that matter.
Start by analyzing your actual usage. Everything else follows from that data.
You might also like
Want to build your own AI OS?
The AI OS Blueprint gives you the complete system: 53-page playbook, working skills, and a clonable repo. Starting at $47.
30-day money-back guarantee. No subscription.