Rate Limiting and Quota Management for LLM Systems

— LLM APIs have strict rate limits and token quotas. Hit them unexpectedly, and your application breaks. Here's how to stay within limits while serving users reliably.

level: intermediate topics: llmops, reliability, infrastructure tags: rate-limiting, quotas, reliability, infrastructure

Your LLM-powered application is working great. Traffic is growing. Then suddenly, every request fails with 429 errors: “Rate limit exceeded.”

You’ve hit the LLM provider’s rate limit. Your application is down until the limit resets. Users are angry. You didn’t even know you were approaching the limit.

This happens all the time, and it’s avoidable. LLM APIs have rate limits and token quotas that you must respect. Managing them properly is the difference between a reliable system and one that fails unpredictably.

Understanding LLM Provider Limits

Most LLM APIs have multiple types of limits:

Requests per minute (RPM): Maximum number of API calls you can make in a 60-second window. Exceeding this gives you 429 errors.

Tokens per minute (TPM): Maximum tokens (input + output) you can process in 60 seconds. A single long request can exhaust your quota.

Tokens per day (TPD): Some providers also have daily limits. You might stay under RPM and TPM but still hit TPD if you have sustained high usage.

Concurrent requests: Some APIs limit how many requests can be in flight simultaneously, regardless of total throughput.

These limits vary by pricing tier, model, and provider. Free tiers have low limits; paid tiers have higher limits but they’re still finite.

Measuring Your Current Usage

Before optimizing, understand your baseline.

Track RPM and TPM: Log every API call with timestamp and token count. Aggregate by minute to see peak usage.

Identify traffic patterns: Does usage spike at certain times of day? Are there patterns (weekdays vs. weekends, business hours vs. off-hours)?

Measure per-request token usage: What’s the average token count per request? What’s the p95? What’s the max?

This data tells you how close you are to limits and where spikes come from.

Request-Level Rate Limiting

Protect your system from exhausting quotas.

Client-side rate limiting: Before sending a request to the LLM API, check if you’re within your quota. If not, queue the request or reject it.

Token budgeting: Track tokens consumed in the current window. Before making a call, check if it would exceed your TPM. If so, delay the request.

Backoff and retry: When you get a 429 error, don’t immediately retry. Use exponential backoff: wait 1s, then 2s, then 4s. Respect the provider’s rate limit reset time.

Circuit breakers: If you get repeated 429s, temporarily stop sending requests. This prevents cascading failures and gives the quota time to reset.

User-Level Quotas

Prevent individual users from monopolizing your LLM quota.

Per-user rate limits: Each user gets a max number of requests per minute. This prevents one user from exhausting your global quota.

Per-user token limits: Limit not just request count but token usage. A user who sends extremely long prompts can consume disproportionate quota.

Throttling high-volume users: If a user consistently hits their quota, slow them down or prompt them to upgrade to a paid tier.

Prioritizing Traffic

When you’re near quota limits, not all requests are equal.

Priority queues: Critical requests (paid users, high-value features) get processed first. Lower-priority requests (free users, experimental features) wait or get dropped.

Graceful degradation: When quota is tight, disable non-essential features. Maybe you skip expensive summarization and return raw search results instead.

Elastic quotas: If you have headroom, allow bursts above normal limits. If you’re near capacity, enforce stricter limits.

Quota Monitoring and Alerting

You need real-time visibility into quota usage.

Dashboard showing current usage: Display RPM, TPM, and remaining quota. Make it easy to see if you’re approaching limits.

Alerts at threshold levels: Don’t wait until you hit the limit. Alert when you reach 70% of quota, then 85%, then 95%. This gives you time to respond.

Historical usage tracking: Store usage data over time. Identify trends (is usage growing steadily? are there seasonal spikes?) and plan capacity accordingly.

Requesting Quota Increases

When your usage legitimately exceeds available quotas, request increases from your provider.

Provide usage data: Show historical usage and projected growth. Providers are more likely to approve increases if you demonstrate need.

Explain your use case: Some use cases (production applications, enterprise customers) get higher priority than others (personal projects, experimentation).

Plan ahead: Quota increase requests can take days to approve. Don’t wait until you’re hitting limits—request increases before you need them.

Load Shedding Strategies

When you can’t serve all traffic, decide what to drop.

Reject low-priority requests: Return errors for non-critical features before failing critical ones.

Degrade quality: Use cheaper, faster models for some requests. GPT-3.5 instead of GPT-4, smaller context windows, fewer retries.

Queue and delay: Instead of rejecting requests, queue them and process when quota is available. Let users know there’s a delay.

Communicate clearly: Don’t let requests silently fail. Tell users “High demand—your request is queued” or “Service temporarily degraded—responses may be slower.”

Caching to Reduce Quota Usage

The best way to stay within quotas: don’t make unnecessary API calls.

Cache common queries: If many users ask similar questions, cache responses and reuse them. This saves quota and improves latency.

Prompt caching: Use provider-level prompt caching to avoid re-processing long system prompts on every request.

Result reuse: For deterministic queries (like “What’s the capital of France?”), cache aggressively. No need to call the API every time.

Batching Requests

Some LLM APIs support batching multiple prompts into a single request.

How it works: Send 10 prompts in one API call instead of 10 separate calls. This reduces RPM usage (1 request instead of 10) but not TPM (you still process the same tokens).

When it helps: If you’re RPM-limited but not TPM-limited, batching improves throughput.

When it doesn’t help: If you’re TPM-limited, batching doesn’t reduce token usage. And it increases latency (you wait for all responses, not just the first).

Multi-Provider Strategies

Don’t rely on a single LLM provider.

Fallback providers: If OpenAI is rate-limited, fall back to Anthropic or a self-hosted model. This improves reliability but adds complexity (you need to support multiple APIs and models).

Load balancing: Distribute traffic across multiple providers or accounts. This effectively multiplies your quota.

Provider-specific routing: Route different use cases to different providers. Use OpenAI for complex reasoning, Anthropic for long-context tasks, self-hosted models for simple queries.

Self-Hosting to Avoid External Limits

If external API quotas are a bottleneck, consider self-hosting models.

Pros: No rate limits (beyond your hardware). Predictable costs. Full control.

Cons: Upfront infrastructure cost. Operational complexity. Need GPU expertise.

When it makes sense: High-volume applications where LLM API costs are a significant line item, or where rate limits are a frequent problem.

Agent and Loop Protections

Agents that iterate can burn through quotas quickly.

Max iterations per agent session: Cap how many LLM calls an agent can make. This prevents runaway loops.

Token budgets per session: Allocate a fixed token budget per user session. When exhausted, terminate gracefully.

Anomaly detection: If a single session is consuming 10x normal quota, flag it and investigate. Could be a bug, a malicious user, or an edge case.

What Good Looks Like

A well-managed quota system:

  • Tracks RPM, TPM, and other limits in real-time
  • Alerts before hitting limits, not after
  • Implements client-side rate limiting to stay within quotas
  • Prioritizes critical traffic when quota is tight
  • Uses caching aggressively to reduce unnecessary API calls
  • Has fallback strategies (degradation, queuing, multi-provider)
  • Monitors per-user usage to prevent abuse
  • Plans ahead for quota increases based on usage trends

Rate limiting isn’t glamorous, but it’s essential. Without it, your application fails unpredictably when traffic spikes or a few users consume excessive quota.

Build quota management into your system from day one. Monitor usage, set alerts, implement protections, and plan for growth. Your future self (and your users) will thank you.