Rate Limiting Explained

API SecurityJune 27, 2026·18 min read·By Keyur Patel

Every production API needs rate limiting. Without it, a single misbehaving client — or a deliberate attacker — can overwhelm your server with thousands of requests per second, crashing your application and affecting all users. Rate limiting controls how many requests each client can make within a time window, protecting your infrastructure while ensuring fair access for everyone.

This guide covers everything: what rate limiting is, the four major algorithms (and when to use each), HTTP 429 responses, rate limit headers, practical implementation in Node.js/Redis/NGINX, client-side handling, and real-world examples from GitHub, Stripe, and OpenAI.

What Is Rate Limiting?

Rate limiting is a technique that restricts the number of requests a client can make to an API within a specific time period. When a client exceeds the limit, the server rejects further requests with an HTTP 429 "Too Many Requests" response until the time window resets.

Real-world analogy: Think of a theme park with limited capacity. Only 100 visitors can enter per hour. The first 100 get in immediately. Visitor 101 is told to wait until the next hour. This prevents overcrowding (server overload) and ensures everyone inside has a good experience (acceptable response times).

Why Rate Limiting Matters

DDoS attacksLimits damage from flood of requests — each IP is capped regardless of volume

Brute force loginLimits password attempts — attacker can only try 5 passwords/min instead of thousands

API abuse (scraping)Prevents one client from consuming all server resources, leaving nothing for others

Accidental infinite loopsA client bug that retries endlessly gets stopped at the rate limiter

Cost controlEach request costs compute, bandwidth, and database queries — limits cap costs

Fair usageEnsures paid tiers get more access while free tiers are capped appropriately

Rate Limiting Algorithms

There are four main algorithms for implementing rate limiting. Each has different trade-offs in accuracy, memory usage, burst handling, and implementation complexity.

1. Fixed Window Counter

The simplest algorithm. Divide time into fixed windows (e.g., 1-minute blocks). Count requests in the current window. If the count exceeds the limit, reject. When the window ends, reset to zero.

Problem: A client can send 100 requests at 11:59:59 and another 100 at 12:00:01 — 200 requests in 2 seconds, double the intended rate. This "boundary burst" is Fixed Window's weakness.

// Fixed Window in Redis
async function fixedWindowLimit(clientId, limit, windowSec) {
  const key = `ratelimit:${clientId}:${Math.floor(Date.now() / 1000 / windowSec)}`
  const count = await redis.incr(key)
  if (count === 1) await redis.expire(key, windowSec)
  return count <= limit
}

2. Sliding Window

Solves the boundary burst problem by considering a "sliding" time window that moves with the current time. Instead of fixed blocks, it looks at the last N seconds from right now. More accurate than Fixed Window but uses more memory (needs to track individual request timestamps).

The Sliding Window Counter variant approximates this by weighting the previous window's count. Example: if you're 30% through the current window, count = (current_count) + (previous_count × 0.7). This gives good accuracy with minimal memory.

3. Token Bucket

Imagine a bucket that holds tokens. Tokens are added at a fixed rate (e.g., 10 tokens/second). Each request consumes one token. If the bucket is empty, the request is rejected. If the bucket is full, extra tokens are discarded (max capacity = burst limit).

Token Bucket is the most popular algorithm because it naturally supports bursts. If a client hasn't made requests for a while, their bucket is full — they can send a burst of requests immediately. This matches real user behavior (idle periods followed by activity spikes).

// Token Bucket conceptual implementation
class TokenBucket {
  constructor(capacity, refillRate) {
    this.capacity = capacity       // Max tokens (burst limit)
    this.tokens = capacity         // Start full
    this.refillRate = refillRate   // Tokens per second
    this.lastRefill = Date.now()
  }

  consume() {
    this.refill()
    if (this.tokens >= 1) {
      this.tokens -= 1
      return true  // Request allowed
    }
    return false   // Rate limited (429)
  }

  refill() {
    const now = Date.now()
    const elapsed = (now - this.lastRefill) / 1000
    this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillRate)
    this.lastRefill = now
  }
}

4. Leaky Bucket

Imagine a bucket with a hole at the bottom. Requests flow in at any rate (filling the bucket), but they flow out (are processed) at a constant rate through the hole. If the bucket overflows (too many pending requests), new ones are rejected.

Leaky Bucket smooths traffic — regardless of how bursty the input is, output is always at a constant rate. This provides predictable server load but doesn't allow legitimate bursts. It's ideal for systems that need consistent throughput (video streaming, network queues).

Algorithm Comparison

Feature	Fixed Window	Sliding Window	Token Bucket	Leaky Bucket
Accuracy	Low (boundary bursts)	High	High	High
Burst support	Unintended at boundaries	No	Yes (built-in)	No (smooths traffic)
Memory usage	Very low (1 counter)	High (timestamps)	Low (2 values)	Low (queue size)
Implementation	Very simple	Medium	Medium	Medium
Distributed (Redis)	Easy	Moderate	Easy	Moderate
Best for	Simple internal APIs	Accurate public APIs	Most REST APIs	Streaming/network

Rate Limiting HTTP Headers

Well-designed APIs communicate rate limit status in response headers so clients can manage their request rate proactively:

// Standard rate limit response headers
HTTP/1.1 200 OK
RateLimit-Limit: 100              → Maximum requests per window
RateLimit-Remaining: 73           → Requests remaining in current window  
RateLimit-Reset: 1719216060       → Unix timestamp when window resets
X-RateLimit-Limit: 100           → (legacy non-standard version)
X-RateLimit-Remaining: 73        → (legacy non-standard version)

// When rate limited (429 response)
HTTP/1.1 429 Too Many Requests
Retry-After: 45                   → Seconds until client can retry
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 1719216060
Content-Type: application/json

{"error": "Rate limit exceeded", "retryAfter": 45}

Always include these headers — they let well-behaved clients back off before hitting the limit, reducing unnecessary 429 responses and improving overall API throughput.

Implementing Rate Limiting (Node.js + Redis)

// Express.js rate limiter using Redis (Token Bucket)
const Redis = require("ioredis")
const redis = new Redis()

async function rateLimit(req, res, next) {
  const clientId = req.ip // Or req.user.id for authenticated users
  const key = `ratelimit:${clientId}`
  const limit = 100          // Max tokens (burst capacity)
  const refillRate = 10      // Tokens per second
  const now = Date.now()

  // Atomic Redis operation
  const [tokens, lastRefill] = await redis.mget(
    `${key}:tokens`, `${key}:lastRefill`
  )

  let currentTokens = parseFloat(tokens) || limit
  const lastTime = parseInt(lastRefill) || now
  const elapsed = (now - lastTime) / 1000

  // Refill tokens
  currentTokens = Math.min(limit, currentTokens + elapsed * refillRate)

  if (currentTokens < 1) {
    const retryAfter = Math.ceil((1 - currentTokens) / refillRate)
    res.set("Retry-After", retryAfter)
    res.set("RateLimit-Limit", limit)
    res.set("RateLimit-Remaining", 0)
    return res.status(429).json({ error: "Too many requests", retryAfter })
  }

  // Consume token
  currentTokens -= 1
  await redis.mset(
    `${key}:tokens`, currentTokens,
    `${key}:lastRefill`, now
  )
  await redis.expire(`${key}:tokens`, 120)
  await redis.expire(`${key}:lastRefill`, 120)

  res.set("RateLimit-Limit", limit)
  res.set("RateLimit-Remaining", Math.floor(currentTokens))
  next()
}

// Apply to all routes
app.use(rateLimit)

// Or per-endpoint with different limits
app.post("/auth/login", rateLimit({ limit: 5, window: 60 }), loginHandler)

Client-Side: Handling 429 Responses

Good clients respect rate limits. Here is how to implement exponential backoff with jitter:

// Exponential backoff with jitter
async function fetchWithRetry(url, options, maxRetries = 3) {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    const response = await fetch(url, options)
    
    if (response.status !== 429) return response
    
    // Rate limited — wait and retry
    const retryAfter = response.headers.get("Retry-After")
    const waitTime = retryAfter 
      ? parseInt(retryAfter) * 1000
      : Math.min(1000 * Math.pow(2, attempt) + Math.random() * 1000, 30000)
    
    console.log(`Rate limited. Retrying in ${waitTime}ms (attempt ${attempt + 1})`)
    await new Promise(resolve => setTimeout(resolve, waitTime))
  }
  
  throw new Error("Max retries exceeded — still rate limited")
}

// Usage
const data = await fetchWithRetry("/api/users", { 
  headers: { "Authorization": "Bearer token..." }
})

Rate Limiting vs Throttling

Aspect	Rate Limiting	Throttling
Action taken	Rejects excess requests (429)	Delays/queues excess requests
User experience	Immediate failure feedback	Slower responses (waiting in queue)
Server load	Blocked requests use no resources	Queued requests still consume memory
Best for	APIs, microservices, public endpoints	Internal queues, background jobs
Implementation	Counter/token check → accept or reject	Queue with controlled drain rate
Client awareness	Client knows immediately (429)	Client may not know it's being throttled

Common Rate Limiting Mistakes

⚠️ No rate limiting at all

Many APIs launch without any rate limiting. One aggressive scraper or accidental infinite loop in a client can bring down the entire service.

✅ Fix: Add basic rate limiting from day one. Even a generous limit (1000/min) protects against catastrophic abuse.

⚠️ Per-server counters without shared storage

If you have 4 servers, a client can send 4x the intended limit by hitting each server separately. Each server's local counter only sees 1/4 of the traffic.

✅ Fix: Use Redis or a centralized store for rate limit counters. All servers share the same count per client.

⚠️ Not returning Retry-After header

Without Retry-After, clients don't know when to retry. They either hammer the API immediately (worsening load) or wait too long (poor UX).

✅ Fix: Always include Retry-After in 429 responses. Clients can programmatically respect it.

⚠️ Same limits for all endpoints

Login endpoints need much stricter limits (5/min) than read endpoints (1000/min). Applying the same limit everywhere either leaves sensitive endpoints exposed or makes the API unusable.

✅ Fix: Set per-endpoint limits: strict for auth/payment, generous for reads, moderate for writes.

⚠️ Rate limiting by IP only

Behind corporate NATs, thousands of users share one IP. Rate limiting by IP blocks entire offices. Also, attackers can distribute across many IPs.

✅ Fix: Use API key or user ID for authenticated requests. Use IP only for unauthenticated/anonymous traffic.

Best Practices

✓

Use Redis for distributed rate limiting

Atomic operations (INCR, EXPIRE), sub-ms performance, and shared state across all server instances.

✓

Always return HTTP 429 (not 403 or 500)

429 is the standard code for rate limiting. Clients can distinguish 'too many requests' from 'forbidden' or 'server error'.

✓

Include RateLimit-* headers in every response

Let clients see their remaining quota proactively — they can slow down before hitting the limit.

✓

Use Token Bucket for most APIs

Natural burst support matches real user behavior. Users are idle, then active in bursts. Token Bucket allows this.

✓

Different limits for different user tiers

Free: 60/hour, Pro: 1000/hour, Enterprise: 10000/hour. Incentivizes upgrades while protecting from abuse.

✓

Stricter limits on auth/payment endpoints

✓

Implement graceful degradation

When near capacity, serve cached responses or simplified data instead of hard 429s for read endpoints.

✓

Monitor and alert on rate limit violations

Track which clients hit limits frequently. Repeated violations may indicate abuse or a broken client.

✓

Document your rate limits publicly

Publish limits in API docs so developers can design clients that respect them from the start.

✓

Add jitter to client retry logic

If all rate-limited clients retry at exactly Retry-After seconds, they all hit simultaneously. Add random jitter to spread retries.

Real-World Rate Limiting Examples

GitHub API

Unauthenticated: 60 req/hour. Authenticated: 5,000 req/hour. Search: 30 req/min. Uses Token Bucket with per-user tracking.

Headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset

Stripe API

100 read requests/sec, 25 write requests/sec per API key. Higher limits available for production accounts.

Headers: Standard RateLimit headers + detailed error messages

OpenAI API

Tiered by model and plan. GPT-4: 10,000 tokens/min (free), 1M tokens/min (paid). Rate limits on both requests AND tokens.

Headers: x-ratelimit-limit-requests, x-ratelimit-limit-tokens, x-ratelimit-remaining-*

Twitter/X API

App-level and user-level limits. Tweets: 300/3h per user. Read: varies by tier (Free: 1500/month). Uses per-endpoint windows.

Headers: x-rate-limit-limit, x-rate-limit-remaining, x-rate-limit-reset

Frequently Asked Questions

What is rate limiting?

Rate limiting is a technique that controls how many requests a client can make to an API within a specific time window. When the limit is exceeded, the server rejects further requests (usually with HTTP 429) until the window resets.

Why do APIs limit requests?

APIs limit requests to prevent abuse (DDoS, brute force), ensure fair usage among all clients, protect server resources from overload, control costs (each request costs compute/bandwidth), and maintain performance for all users.

What happens when you get HTTP 429?

HTTP 429 means 'Too Many Requests' — you've exceeded the rate limit. The response usually includes a Retry-After header telling you how many seconds to wait before retrying. Clients should implement exponential backoff.

What is the Retry-After header?

Retry-After is a response header sent with 429 (and 503) responses. It tells the client exactly how long to wait before making another request — either as seconds (Retry-After: 60) or a date (Retry-After: Sat, 28 Jun 2026 12:00:00 GMT).

Which rate limiting algorithm is best?

It depends on your needs. Token Bucket is best for APIs that need burst support. Sliding Window provides the most accurate limiting. Fixed Window is simplest to implement. Leaky Bucket is best for traffic smoothing. Most production APIs use Token Bucket or Sliding Window.

Token Bucket vs Leaky Bucket — what's the difference?

Token Bucket allows bursts (sends multiple requests quickly if tokens are available). Leaky Bucket smooths traffic (processes requests at a constant rate regardless of input bursts). Token Bucket is more flexible; Leaky Bucket provides more predictable server load.

Can authenticated users have higher limits?

Yes. Most APIs provide tiered limits: anonymous users get low limits (60/hour), authenticated free users get medium (1000/hour), and paid users get higher (10000/hour). This is standard practice for SaaS APIs.

Is rate limiting the same as throttling?

No. Rate limiting rejects excess requests immediately (429 response). Throttling slows down requests by queuing or delaying them. Rate limiting says 'no', throttling says 'wait'. Some systems combine both approaches.

Can rate limiting stop DDoS attacks?

Rate limiting helps with small-scale abuse but cannot fully stop distributed DDoS attacks (millions of IPs sending few requests each). DDoS protection requires additional layers: CDNs (Cloudflare), IP reputation systems, and network-level filtering.

Why use Redis for rate limiting?

Redis provides atomic operations (INCR, EXPIRE) that prevent race conditions, sub-millisecond performance, and shared state across multiple server instances. Without Redis (or similar), each server tracks limits independently — a client can bypass limits by hitting different servers.

Conclusion

Rate limiting is not optional for production APIs — it's a fundamental requirement for security, stability, and fair usage. Without it, a single misbehaving client can bring down your entire service. With proper rate limiting, your API remains fast and available for all users even under heavy load or attack.

For most REST APIs, start with Token Bucket (supports bursts, easy to implement with Redis) and return proper HTTP 429 responses with Retry-After headers. Set different limits per endpoint (strict for auth, generous for reads), per user tier (free vs paid), and monitor violations. Implement this from day one — retrofitting rate limiting into a production API under attack is much harder than building it in from the start.

← Back to Blog