API Rate Limiting Strategies and Best Practices Explained

Q: What headers should I return for rate limit info?

The IETF draft standardizes RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset. The legacy X-RateLimit- variants are still common (GitHub, Stripe). Pick one set and document it. Always include Retry-After on 429 responses (in seconds, not date format — simpler and less error-prone).

Every API that's reachable from the internet needs rate limiting. Without it, a single misbehaving client — whether a buggy script or an intentional attacker — can degrade service for everyone else. Rate limiting is not complicated to implement, but the algorithm you choose and how you communicate limits to clients makes a real difference in practice.

Why Rate Limiting Exists

The reasons stack. At the lowest level, your server has finite resources — CPU, memory, database connections, third-party API quotas. A client hammering 1,000 requests per second can exhaust those resources and degrade service for everyone else.

Beyond raw capacity, rate limiting protects against specific abuse patterns: credential stuffing (automated login attempts), scraping (bulk data extraction), DDoS amplification (using your API as an attack multiplier), and runaway bugs in client code that loops without backoff.

It also enables fair usage — giving all clients a reasonable slice of capacity rather than letting aggressive ones crowd out the rest.

Fixed Window

The simplest algorithm. Divide time into fixed intervals (say, 1 minute), count requests per window, reject once the count exceeds the limit.

Window: 00:00 - 01:00  → 100 requests allowed
Window: 01:00 - 02:00  → 100 requests allowed, counter resets

Implementation is trivial: one counter per client, reset on the minute boundary. Redis makes this easy:

async function isRateLimited(clientId) {
  const key = `ratelimit:${clientId}:${Math.floor(Date.now() / 60000)}`;
  const count = await redis.incr(key);
  if (count === 1) await redis.expire(key, 60);
  return count > 100;
}

The weakness is the boundary burst: a client can send 100 requests at 00:59 and 100 more at 01:01, effectively getting 200 requests in 2 seconds while technically staying within limits. For most APIs that's fine; for sensitive endpoints it isn't.

Sliding Window

Sliding window tracks requests within the last N seconds relative to the current moment, not relative to a fixed clock boundary. The burst problem disappears.

A practical approximation uses two fixed-window counters (current and previous window) and weights them by how far into the current window you are:

effective_count = previous_count × (1 - elapsed_fraction) + current_count

This is how many Redis-based rate limiters work in practice — it's much cheaper than storing a timestamp for every request and gives a close enough approximation of a true sliding window.

Token Bucket

Token bucket is the most common algorithm for production rate limiters. Think of it as a bucket that holds tokens. Each request consumes a token. Tokens refill at a steady rate up to the bucket's capacity.

Bucket capacity: 100 tokens
Refill rate: 10 tokens/second

Client sends 100 requests instantly → bucket drains to 0
Client must wait 10 seconds to have 100 tokens again
Or client can send 10 requests/second indefinitely without waiting

Token bucket allows bursting — a client can use accumulated tokens quickly when needed, then fall back to the sustained rate. That matches how real usage behaves: a user might trigger 20 API calls by loading a dashboard, then be quiet for 30 seconds.

State you need per client: current token count and last refill timestamp. No fixed windows, no clock boundaries.

Leaky Bucket

Leaky bucket works the other way: requests enter a queue at any rate, but they're drained and processed at a fixed constant rate. Overflow is dropped or rejected.

Queue depth: 10 requests
Process rate: 5 requests/second

Burst of 10 → all queued, processed over 2 seconds
Burst of 20 → 10 queued, 10 rejected

Leaky bucket enforces a smooth outbound rate even when inbound traffic is bursty. It's the right choice when protecting a downstream dependency that can't handle bursts — a payment processor or third-party API with its own rate limit. For protecting your own API from clients, token bucket usually fits better because it's more forgiving about short bursts.

What a 429 Response Should Look Like

When you reject a request due to rate limiting, the correct HTTP status is 429 Too Many Requests. The response should include a Retry-After header so clients know when they can try again.

HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json

{
  "error": "rate_limit_exceeded",
  "message": "Too many requests. Retry after 30 seconds.",
  "retry_after": 30
}

Retry-After accepts either a number of seconds or an HTTP date. Seconds is simpler and less error-prone. The 429 status itself was introduced by RFC 6585. See HTTP Status Codes Guide for context on the 4xx range.

Rate Limit Headers

Beyond 429 responses, proactive headers let well-behaved clients throttle themselves before hitting the limit. The emerging standard (from the IETF RateLimit Headers draft) uses:

RateLimit-Limit: 100
RateLimit-Remaining: 73
RateLimit-Reset: 1715200800

Many APIs use X- prefixed variants — they predate the standardization effort:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 73
X-RateLimit-Reset: 1715200800

Reset is typically a Unix timestamp (when the window or bucket refills). Some APIs include Retry-After on non-429 responses too, as a hint when the client is getting close to the limit.

GitHub, Stripe, and Cloudflare all use variations of this pattern. Whatever you pick, be consistent and document it.

Client-Side Strategies

If you're writing a client that calls a rate-limited API, the most important behavior to implement is exponential backoff with jitter.

When you get a 429 (or a 503), don't immediately retry. Wait, then retry. If it fails again, wait longer:

async function fetchWithRetry(url, options, attempt = 0) {
  const response = await fetch(url, options);

  if (response.status === 429 || response.status === 503) {
    if (attempt >= 5) throw new Error('Max retries exceeded');

    const retryAfter = response.headers.get('Retry-After');
    const baseDelay = retryAfter ? parseInt(retryAfter) * 1000 : 1000 * Math.pow(2, attempt);

    // Add jitter: randomize within ±25% of base delay
    const jitter = baseDelay * 0.25 * (Math.random() * 2 - 1);
    const delay = Math.round(baseDelay + jitter);

    await new Promise(resolve => setTimeout(resolve, delay));
    return fetchWithRetry(url, options, attempt + 1);
  }

  return response;
}

Why jitter? Without it, a burst of 1,000 clients all hitting a rate limit simultaneously will all retry after the same delay — producing another synchronized burst. Jitter spreads the retries over time and prevents the thundering herd. AWS provides a useful breakdown of this pattern in their exponential backoff and jitter guide.

Rate Limiting by IP vs API Key vs User

The unit of rate limiting matters.

By IP address is the baseline — no auth required, easy to implement. Weakness: shared IP addresses (NAT, corporate proxies, VPNs) mean many legitimate users share a limit. Also relatively easy to circumvent with proxies.

By API key is the standard for developer APIs. Keys are cheap to issue and revoke, and you can set different limits per tier (free/paid/enterprise). Extract the key from the Authorization header before applying the limit:

const apiKey = req.headers['authorization']?.replace('Bearer ', '');
const limitKey = apiKey || req.ip;  // fall back to IP if no key

By authenticated user ID works well for logged-in products. One user's usage doesn't affect others sharing a NAT. Can be combined with IP limiting for unauthenticated endpoints.

By endpoint — different limits for different operations is common and reasonable. A search endpoint that hits the database should have a tighter limit than a health check. Mutation endpoints (POST/PUT/DELETE) often warrant a lower limit than read endpoints.

Implementation Notes

For a Node.js/Express backend, the express-rate-limit package covers most common use cases. Pair it with rate-limit-redis if you're running multiple server instances — otherwise each instance has its own counter and the limits are effectively multiplied by your instance count.

const rateLimit = require('express-rate-limit');

const apiLimiter = rateLimit({
  windowMs: 60 * 1000,  // 1 minute
  max: 100,
  standardHeaders: true,  // Return RateLimit-* headers
  legacyHeaders: false,
  message: { error: 'rate_limit_exceeded', retry_after: 60 },
});

app.use('/api/', apiLimiter);

Set trust proxy correctly if you're behind a load balancer or reverse proxy. Without it, req.ip returns the proxy's IP and you'll rate-limit your entire user base as a single client.

Formatting API Responses

When debugging rate limit behavior, inspecting the JSON error responses and headers is the fastest way to understand what's happening. The JSON Formatter helps when working with nested rate limit error bodies from third-party APIs. If you're crafting query strings that include rate limit parameters or debugging callback URLs, the URL Encoder saves time on encoding edge cases.

For a broader view of RESTful API design including error response conventions, see REST API Design Best Practices.

FAQ

Should I rate-limit by IP, API key, or user ID?

By API key when available — it's stable, cheap to revoke, and lets you tier limits per customer. Fall back to IP for unauthenticated endpoints. User ID is good for logged-in apps where users don't have API keys. The pattern most production systems use is layered: IP limit on auth endpoints + API key/user limit on authenticated endpoints.

Token bucket or fixed window?

Token bucket for most production APIs — it allows reasonable bursts (which match real user behavior) while enforcing a sustained rate. Fixed window is simpler but has the boundary-burst problem (200 requests crossing the minute boundary). Sliding window is in between. Pick token bucket unless you have a specific reason; it's what Stripe, GitHub, and Cloudflare use.

What's the right rate limit value?

Depends entirely on your endpoint cost. Read-heavy endpoints (cached responses): 1000+/min per user. Write endpoints touching the database: 60-300/min. Login/auth: 5-10/min per IP to slow brute force. Expensive operations (image processing, exports): 10-30/hour. Start conservative and tune up; tightening a limit after release angers users.

Why does my rate limiter let through more requests than the limit suggests?

Almost always one of three things: (1) you're running multiple server instances and each has its own counter (use Redis-backed limiters), (2) the boundary-burst issue with fixed windows, (3) trust proxy is wrong so you're rate-limiting on the load balancer's IP rather than the client's. Check req.ip — if it's always the same value, fix the proxy config.

How should I handle rate limits across multiple servers?

Use a shared store — Redis is the standard. Each server reads/writes counters from the same Redis cluster, so the limit applies to all traffic regardless of which server it hits. The rate-limit-redis package for express-rate-limit and similar plugins for other frameworks handle this in a few lines.

What headers should I return for rate limit info?

The IETF draft standardizes RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset. The legacy X-RateLimit-* variants are still common (GitHub, Stripe). Pick one set and document it. Always include Retry-After on 429 responses (in seconds, not date format — simpler and less error-prone).

Should I return 429 or 503 when rate-limited?

429 Too Many Requests for client-specific limits ("you've hit your quota"). 503 Service Unavailable for system-wide protection ("the whole API is overwhelmed"). They mean different things to clients: 429 says "back off, retry"; 503 says "the service is down for everyone." Don't conflate them.

How do I prevent thundering herd retries?

Add jitter to exponential backoff. Without jitter, 1000 rate-limited clients all retry at exactly the same delay and synchronize into another burst. With jitter (randomize within ±25% of base delay), retries spread out over time. AWS's "exponential backoff with jitter" pattern is the canonical reference; implement it client-side or document it for API consumers.