Skip to content

Reliability Patterns

Building resilient systems requires implementing patterns that handle failures gracefully. In distributed systems, failures are inevitable—networks go down, services get overloaded, databases become slow, and bugs slip through. The question isn't whether failures will happen, but how gracefully your system handles them.

This guide covers the essential reliability patterns every backend engineer should know to build systems that stay operational when things go wrong.


Retry with Exponential Backoff

The Problem

In distributed systems, failures are often transient—they resolve themselves given time. A network packet might be dropped, a database connection might be momentarily exhausted, or a dependent service might be restarting. If you simply retry immediately when a failure occurs, you can actually make the problem worse.

Consider this scenario: 1,000 clients simultaneously call a service that's overloaded. The service fails, and all 1,000 clients retry immediately. Now the service faces 2,000 requests and fails again. This creates a thundering herd that can bring down the entire system.

The Solution

Exponential backoff increases the wait time between retries exponentially (100ms, 200ms, 400ms, 800ms...). This gives the failing service time to recover while reducing load during the retry window.

Jitter adds randomness to the delay, ensuring that retries are spread out over time rather than happening in synchronized waves. Without jitter, all clients with the same retry configuration would retry at nearly the same time, recreating the thundering herd problem.

When to Use Retry

Retry is appropriate for:

ScenarioShould Retry?Reason
Network timeouts✅ YesTransient connection issues
5xx server errors✅ YesServer is overloaded or restarting
Rate limit exceeded (429)✅ YesWait and retry after limit resets
4xx client errors❌ NoClient error won't fix on retry
Authentication failures (401)❌ NoInvalid credentials won't work later
Resource not found (404)❌ NoResource genuinely doesn't exist
Business logic errors❌ NoNot a transient failure

Implementation Considerations

1. Never Retry Non-Idempotent Operations by Default

If an operation isn't idempotent, retrying can cause duplicate side effects:

typescript
// BAD: This could charge the customer twice
await retry(() => paymentService.charge(customerId, amount));

// GOOD: Only retry if the operation is idempotent
await retry(() =>
  paymentService.charge(customerId, amount, {
    idempotencyKey: generateIdempotencyKey(),
  }),
);

2. Set Appropriate Retry Limits

More retries aren't always better:

Use CaseMax RetriesReason
Internal service calls3-5Balance resilience vs latency
External API calls2-3Respect external rate limits
User-facing requests1-2Fast fail for better UX
Background jobs5-10Can tolerate longer delays

3. Always Implement Jitter

Without jitter, retries can synchronize and create new problems:

typescript
// WITHOUT jitter - all clients retry at 100ms, 200ms, 400ms...
// WITH jitter - clients retry at 87ms, 124ms, 156ms... (spread out)

4. Log Retry Attempts

For debugging, track retry behavior:

typescript
logger.info(`Retrying ${operationName}`, {
  attempt: attempt + 1,
  maxRetries,
  delayMs,
  error: error.message,
});

Jitter Strategies Explained

StrategyFormulaCharacteristics
Full Jitterrandom(0, base * 2^attempt)Best overall choice; maximizes spread but may wait too long
Equal Jitterbase * 2^attempt + random(0, base)Reduces max wait time; good when latency matters
Decorrelated Jitterrandom(base, min(cap, base * 3 * rand))Adaptive; spreads well over longer periods

Full Jitter is the default recommendation because it provides the best spread while being simple to implement.

Common Pitfalls

  1. Retrying forever: Always set a maximum retry count
  2. Retrying non-idempotent operations: Can cause data corruption
  3. No jitter: Creates thundering herd on retry
  4. No timeout on retries: A slow downstream service can block requests indefinitely
  5. Retrying client errors: Won't work, wastes resources

Production Checklist

  • ✓ Only retry transient failures (5xx, timeouts, 429)
  • ✓ Always use exponential backoff
  • ✓ Always add jitter to retry delays
  • ✓ Set maximum retry attempts
  • ✓ Set overall timeout for all retries combined
  • ✓ Ensure retried operations are idempotent
  • ✓ Log retry attempts with context
  • ✓ Monitor retry rates in metrics

Circuit Breaker

The Problem

When a downstream service becomes unavailable or severely degraded, continuously retrying requests to it can:

  1. Waste resources (CPU, connections, memory) on doomed requests
  2. Increase latency for all callers
  3. Cause cascading failures that bring down your entire system

Imagine a payment gateway that's down. If every order processing request tries to call the gateway, waits for timeout, then retries, your entire order system becomes slow and overloaded. Even orders that don't need the payment gateway are affected because threads are blocked waiting on the failed gateway.

The Solution

The Circuit Breaker pattern wraps calls to a service and monitors failures. When failures exceed a threshold, the breaker "trips" and immediately rejects new calls without even attempting them. This:

  1. Fails fast instead of waiting for timeouts
  2. Reduces load on the failing service
  3. Gives the service time to recover
  4. Prevents cascading failures

How It Works

The circuit breaker has three states:

     ┌──────────────┐
     │    CLOSED    │ ← Normal state: requests pass through
     └──────┬───────┘   Failures increment counter
            │           Successes reset counter
            │ When failureCount ≥ threshold

     ┌──────────────┐
     │     OPEN     │ ← Failing state: requests rejected immediately
     └──────┬───────┘   After resetTimeout, transition to HALF_OPEN

            │ timeout expires

     ┌──────────────┐
     │   HALF-OPEN  │ ← Testing state: allow some requests through
     └──────┬───────┘   Success → CLOSED, Failure → OPEN

            ├── success threshold met → CLOSED
            └── any failure → OPEN

When to Use Circuit Breaker

Circuit breakers are essential for:

ScenarioWhy Circuit Breaker?
External API callsAPIs can go down or throttle
Database connectionsPrevents connection pool exhaustion
Microservice dependenciesPrevents cascading failures
Expensive operationsFails fast instead of waiting
Rate-limited servicesRespects limits by backing off

Configuration Guidelines

Failure Threshold

The number of consecutive failures before opening the circuit:

Traffic PatternRecommended Threshold
Low volume (< 10 req/s)3-5
Medium volume (10-100 req/s)5-10
High volume (> 100 req/s)10-20

Too low: Circuit trips too easily on normal variance Too high: Too many failures occur before protection kicks in

Reset Timeout

How long to stay open before testing recovery:

ScenarioRecommended Timeout
Quick recovery expected (restarts)10-30 seconds
Normal operations30-60 seconds
Slow recovery (database issues)2-5 minutes

Half-Open Success Threshold

How many consecutive successes in HALF-OPEN before closing:

RecommendationValue
Default3-5
High-stakes operations5-10
Low-stakes operations1-2

This confirms stable recovery before fully closing.

Implementation Best Practices

1. Distinguish Between Failures

Not all errors should increment the failure counter:

typescript
class CircuitBreaker {
  private shouldCountAsFailure(error: Error): boolean {
    // Count these as failures
    if (error instanceof TimeoutError) return true;
    if (error instanceof NetworkError) return true;
    if (error.response?.status >= 500) return true;

    // Don't count these as failures
    if (error.response?.status === 404) return false; // Not found isn't a service failure
    if (error.response?.status === 401) return false; // Auth issue, not service issue

    // Default: count as failure for safety
    return true;
  }
}

2. Provide Fallback Responses

When the circuit is open, don't just throw an error. Provide a graceful fallback:

typescript
try {
  return await circuitBreaker.execute(() => fetchRecommendations());
} catch (error) {
  if (error.message.includes("Circuit breaker is OPEN")) {
    // Return cached or default recommendations
    return await getCachedRecommendations();
  }
  throw error;
}

3. Monitor Circuit State

Expose metrics for observability:

typescript
metrics.gauge("circuit_breaker.state", { service: "payment-gateway" }, state);
metrics
  .counter("circuit_breaker.failures", { service: "payment-gateway" })
  .increment();
metrics
  .counter("circuit_breaker.successes", { service: "payment-gateway" })
  .increment();

Common Pitfalls

  1. Too sensitive: Circuit trips on normal variance
  2. Too slow to recover: Stays open too long, rejecting valid requests
  3. No fallback: Users see errors when circuit is open
  4. No monitoring: Can't tell when circuits are tripping
  5. Wrong error classification: Opens on client errors (4xx) instead of service issues (5xx)
  6. Missing HALF-OPEN: Goes directly to CLOSED, risking immediate re-trip

Circuit Breaker vs Retry

AspectCircuit BreakerRetry
PurposeStop calling failing serviceRetry individual failed calls
ScopeService-level protectionRequest-level resilience
StateHas state (CLOSED, OPEN, HALF-OPEN)Stateless (typically)
Use Together?✅ Yes✅ Yes

Best Practice: Use circuit breaker to protect against cascading failures, and retry to handle transient issues. They complement each other.

Production Checklist

  • ✓ Configure appropriate failure threshold for your traffic volume
  • ✓ Set reset timeout based on expected recovery time
  • ✓ Implement HALF-OPEN state to test recovery
  • ✓ Distinguish service failures from client errors
  • ✓ Provide fallback responses when circuit is open
  • ✓ Expose circuit state and metrics
  • ✓ Log state transitions for debugging
  • ✓ Test circuit trip and recovery behavior
  • ✓ Monitor circuit breaker tripping in production

Rate Limiting

The Problem

Without rate limiting, your service is vulnerable to:

  1. Abuse: Malicious actors can overwhelm your service with requests
  2. Accidental overload: Bugs or misconfigurations can cause excessive traffic
  3. Cost overruns: External API calls, database queries, and third-party services often cost per request
  4. Fairness: A single user consuming all resources impacts others
  5. Downstream protection: Prevent overwhelming dependent services

The Solution

Rate limiting controls the frequency of requests to protect resources, ensure fair usage, and prevent abuse. It enforces limits on how many requests can be made within a time window.

Rate Limiting Scope

ScopeDescriptionUse Case
GlobalAll requests across the entire serviceOverall capacity protection
Per UserLimits per authenticated userFair usage enforcement
Per IPLimits per IP addressDDoS mitigation, guest users
Per API KeyLimits per API keyAPI tier management
Per EndpointLimits per specific endpointProtect expensive operations

Algorithm Comparison

AlgorithmMemoryPrecisionBest For
Fixed WindowO(1) per keyLow (can burst at edges)Simple limits, low memory
Sliding WindowO(n) per keyHighStrict enforcement, fairness
Token BucketO(1) per keyMedium (smooth)API rate limiting, burst allowance
Leaky BucketO(1) per keyMedium (constant)Traffic shaping, smoothing

Fixed Window Counter

How it works: Divide time into fixed windows (e.g., 1 minute). Count requests in each window. Reset count at window boundary.

typescript
class FixedWindowRateLimiter {
  private counters: Map<string, { count: number; windowStart: number }> =
    new Map();

  isAllowed(key: string, maxRequests: number, windowMs: number): boolean {
    const now = Date.now();
    const current = this.counters.get(key);

    // Start new window if needed
    if (!current || now - current.windowStart >= windowMs) {
      this.counters.set(key, { count: 1, windowStart: now });
      return true;
    }

    // Check limit
    if (current.count < maxRequests) {
      current.count++;
      return true;
    }

    return false;
  }
}

Pros: Simple, constant memory, fast

Cons: Can burst at window boundaries (e.g., 100 requests at :59 and 100 at :01 = 200 in 2 seconds for a 100/minute limit)

Sliding Window Log

How it works: Store timestamp of each request. Remove timestamps outside the window. Count remaining.

typescript
class SlidingWindowRateLimiter {
  private timestamps: Map<string, number[]> = new Map();

  isAllowed(key: string, maxRequests: number, windowMs: number): boolean {
    const now = Date.now();
    const windowStart = now - windowMs;

    // Get and filter timestamps for this key
    let timestamps = this.timestamps.get(key) || [];
    timestamps = timestamps.filter((ts) => ts > windowStart);

    // Check if under limit
    if (timestamps.length < maxRequests) {
      timestamps.push(now);
      this.timestamps.set(key, timestamps);
      return true;
    }

    return false;
  }
}

Pros: Precise, no boundary bursts

Cons: O(n) memory per key, slower with high request volume

Token Bucket

How it works: Imagine a bucket that fills with tokens at a constant rate. Each request consumes a token. If bucket is empty, request is rejected.

typescript
class TokenBucketRateLimiter {
  private tokens: Map<string, { count: number; lastRefill: number }> =
    new Map();

  constructor(
    private readonly capacity: number, // Max tokens in bucket
    private readonly refillRate: number, // Tokens per second
    private readonly refillInterval: number = 1000,
  ) {}

  isAllowed(key: string, tokensNeeded: number = 1): boolean {
    const now = Date.now();
    let state = this.tokens.get(key);

    // Initialize new bucket
    if (!state) {
      state = { count: this.capacity, lastRefill: now };
      this.tokens.set(key, state);
    }

    // Refill tokens
    const elapsed = now - state.lastRefill;
    if (elapsed >= this.refillInterval) {
      const tokensToAdd = Math.floor(
        (elapsed / this.refillInterval) * this.refillRate,
      );
      state.count = Math.min(this.capacity, state.count + tokensToAdd);
      state.lastRefill = now;
    }

    // Check if enough tokens
    if (state.count >= tokensNeeded) {
      state.count -= tokensNeeded;
      return true;
    }

    return false;
  }
}

Pros: O(1) memory, allows bursts (up to capacity), smooths traffic

Cons: More complex configuration

Choosing the Right Algorithm

ScenarioRecommended AlgorithmWhy
Simple API rate limitingToken BucketSmooth traffic, allows brief bursts
Strict fairness requirementSliding WindowPrecise enforcement
High-volume, memory-constrainedFixed WindowLowest memory overhead
Traffic shaping (output)Leaky BucketConstant output rate

Rate Limiting Headers

When rate limiting, return informative headers:

HeaderFormatPurpose
X-RateLimit-Limit100Total requests allowed
X-RateLimit-Remaining97Requests left in window
X-RateLimit-Reset1711838400Unix timestamp when window resets
Retry-After60Seconds until user can retry (when limited)

Common Pitfalls

  1. Wrong scope: Using global limits when per-user is needed (or vice versa)
  2. No burst allowance: Rejecting legitimate bursts of activity
  3. Tight limits: Blocking normal user behavior
  4. No informative headers: Users can't tell when they'll be unblocked
  5. In-memory only: Limits reset on restart (use Redis for distributed systems)
  6. Ignoring time zones: Window boundaries can shift unexpectedly

Production Checklist

  • ✓ Choose appropriate scope (global, per-user, per-IP)
  • ✓ Select algorithm based on requirements
  • ✓ Use persistent storage (Redis) for distributed systems
  • ✓ Return rate limit headers in responses
  • ✓ Implement graceful degradation (not hard errors)
  • ✓ Log rate limit violations for abuse detection
  • ✓ Monitor rate limit effectiveness
  • ✓ Test limit boundary conditions

Graceful Failure and Fallback

The Problem

When something goes wrong, most systems default to throwing errors and showing users generic error messages. This creates a terrible user experience and can make minor issues feel like complete outages.

In production, partial failures are normal. A single dependency going down shouldn't take down your entire application.

The Solution

Graceful failure means your system continues to provide useful functionality even when some components are unavailable. Instead of all-or-nothing, you provide degraded service.

Fallback means having alternative ways to accomplish a task when the primary method fails.

The Fallback Hierarchy

Primary Operation

        ├── Success → Return Result

        └── Failure

                ├── Try Cache → Return Stale Data

                └── Cache Miss

                        ├── Try Secondary Service → Return Result

                        └── Secondary Fails

                                └── Return Safe Default

Fallback Strategies

StrategyDescriptionExampleTrade-off
Static FallbackReturn predefined default valueEmpty list, default userSimple but may not be useful
Cached ResponseServe previously cached data5-minute old product dataStale but functional
Feature DisableHide non-critical functionalityDisable recommendationsCore features still work
Queue for LaterAccept request, process asynchronouslyEmail sendingAsync processing complexity
Secondary ServiceUse alternative implementationRead from replica DBAdditional cost/complexity
Partial ResponseReturn what's availableShow loaded products, hide pricesIncomplete but useful

Example: Resilient User Service

typescript
class ResilientUserService implements UserService {
  constructor(
    private primary: UserService, // Main database
    private cache: CacheService, // Redis cache
    private fallback: UserService, // Read replica or backup
    private metrics: MetricsService,
  ) {}

  async getUser(id: string): Promise<User> {
    try {
      // 1. Try primary source
      const user = await this.primary.getUser(id);
      await this.cache.set(`user:${id}`, user, { ttl: 300 }); // Cache for 5 minutes
      return user;
    } catch (primaryError) {
      this.metrics.increment("user_service.primary_failure");

      try {
        // 2. Try cache
        const cached = await this.cache.get<User>(`user:${id}`);
        if (cached) {
          this.metrics.increment("user_service.cache_hit");
          cached.stale = true; // Mark as stale so UI can handle
          return cached;
        }
      } catch (cacheError) {
        this.metrics.increment("user_service.cache_failure");
      }

      try {
        // 3. Try fallback service
        const user = await this.fallback.getUser(id);
        this.metrics.increment("user_service.fallback_success");
        return user;
      } catch (fallbackError) {
        this.metrics.increment("user_service.fallback_failure");

        // 4. Return safe default
        return this.getDefaultUser(id);
      }
    }
  }

  private getDefaultUser(id: string): User {
    return {
      id,
      name: "Guest User",
      isDefault: true,
      isAnonymous: true,
    };
  }
}

Principles of Graceful Degradation

1. Identify Critical vs. Non-Critical Paths

Not all features are equally important:

Critical (Must Work)Non-Critical (Can Fail)
AuthenticationRecommendations
Core transactionsSearch filters
Payment processingAnalytics tracking
Data persistenceReal-time notifications
Security featuresSocial sharing

2. Design Fallbacks Early

Don't add fallbacks as an afterthought. Design them into your architecture from the start:

typescript
// Design with fallback in mind
interface ProductCatalog {
  getProducts(): Promise<Product[]>;
  // What if this fails? What's the fallback?
}

3. Make Fallbacks Visible to Users

When serving degraded data, let users know:

typescript
interface User {
  id: string;
  name: string;
  stale?: boolean; // Flag indicates data might be old
  isDefault?: boolean; // Flag indicates this is a default
}

// UI can show: "Showing cached data (may be outdated)"

4. Monitor Fallback Usage

If you're constantly hitting fallbacks, something is wrong:

typescript
if (metrics.get("user_service.cache_hit").rate > 0.5) {
  alert("Primary user service is degraded! 50% of requests served from cache.");
}

Real-World Examples

Netflix's Fallback Strategy

When Netflix's recommendation engine fails, they don't show an error. Instead:

  1. Fall back to popular/trending content
  2. Fall back to user's watch history
  3. Fall back to curated collections
  4. Show a message: "Recommendations temporarily unavailable, here's what's popular"

Amazon's Add to Cart

When the inventory service is unavailable:

  1. Still allow adding to cart
  2. Show "in stock" with caveat
  3. Validate inventory at checkout
  4. If out of stock then, notify user

Slack's Message Sending

When the primary message store is down:

  1. Store message locally
  2. Show "pending" indicator
  3. Retry in background
  4. Sync when connection restored

Common Pitfalls

  1. Silent failures: Users don't know something is degraded
  2. Cascading fallbacks: Fallback depends on the same failing system
  3. Stale data freshness: Not indicating when data is old
  4. No fallback: All-or-nothing approach
  5. Complex fallback chains: Too many levels make debugging hard
  6. Ignoring errors: Not logging or monitoring fallback usage

Production Checklist

  • ✓ Identify critical vs. non-critical features
  • ✓ Design fallback paths for all critical operations
  • ✓ Use caching as a primary fallback strategy
  • ✓ Implement safe defaults for all user-facing data
  • ✓ Mark degraded responses (stale, default, etc.)
  • ✓ Monitor fallback usage rates
  • ✓ Alert when fallback usage is high
  • ✓ Test fallback paths regularly

Idempotency

The Problem

In distributed systems, networks are unreliable. A request might:

  • Timeout before reaching the server
  • Reach the server but timeout waiting for response
  • Get a 5xx error from the server
  • Succeed but the response gets lost

In all these cases, the client naturally wants to retry the request. But retrying non-idempotent operations causes problems:

Client → Server: POST /charge $100 (timeout)
Client → Server: POST /charge $100 (timeout)
Client → Server: POST /charge $100 (success)

Result: Customer charged $300 instead of $100!

The Solution

An operation is idempotent if performing it multiple times produces the same result as performing it once.

OperationIdempotent?Reason
GET /users/1✅ YesReading doesn't change state
PUT /users/1✅ YesSame payload = same final state
DELETE /users/1✅ YesOnce deleted, stays deleted
POST /orders❌ NoCreates new order each time
POST /payments❌ NoCharges each time

For non-idempotent operations, use an idempotency key to make them safe to retry.

Idempotency Key Pattern

The idempotency key is a unique identifier provided by the client. The server:

  1. Checks if this key has been processed before
  2. If yes, returns the stored result
  3. If no, processes the request and stores the result with the key
Request 1: POST /payments
Headers: Idempotency-Key: abc123
Body: { amount: 100 }
→ Process payment, store result for abc123, return result

Request 2: POST /payments (retry)
Headers: Idempotency-Key: abc123
Body: { amount: 100 }
→ Return stored result for abc123 (no new charge)

Request 3: POST /payments (different request)
Headers: Idempotency-Key: xyz789
Body: { amount: 50 }
→ Process payment, store result for xyz789, return result

Idempotency Key Design

1. Key Generation

typescript
// Option 1: Client-generated UUID (most common)
const idempotencyKey = crypto.randomUUID(); // "550e8400-e29b-41d4-a716-446655440000"

// Option 2: Deterministic based on request details
function generateDeterministicKey(
  userId: string,
  operation: string,
  params: object,
): string {
  const hash = createHash("sha256");
  hash.update(userId);
  hash.update(operation);
  hash.update(JSON.stringify(params));
  return hash.digest("hex"); // "a3d5e9f2..."
}

// Option 3: Combination
const idempotencyKey = `${userId}:${operation}:${uuid()}`;

Recommendation: Use client-generated UUIDs for maximum flexibility. Use deterministic keys only if you need deduplication across clients.

2. Key Scope

Idempotency keys should be scoped to prevent collisions:

ScopeExampleWhen to Use
Globalabc123Simple systems, single operation type
Per-useruser123:abc123Multi-tenant systems
Per-operationpayment:abc123Multiple operation types
Combineduser123:payment:abc123Complex systems

3. Key Expiration

Don't store idempotency keys forever:

Use CaseRecommended TTL
Payments24-48 hours (dispute window)
Orders7 days (typical order lifecycle)
Email sending1 hour (retry window)
General purpose24 hours

Implementation

Simple In-Memory Version

typescript
class IdempotencyService {
  private results = new Map<string, { result: any; timestamp: number }>();

  async execute<T>(
    key: string,
    operation: () => Promise<T>,
    ttlMs: number = 86400000, // 24 hours
  ): Promise<T> {
    const existing = this.results.get(key);

    // Return cached result if exists and not expired
    if (existing) {
      if (Date.now() - existing.timestamp < ttlMs) {
        console.log(`Returning cached result for idempotency key: ${key}`);
        return existing.result;
      } else {
        // Expired, remove it
        this.results.delete(key);
      }
    }

    // Execute operation
    const result = await operation();

    // Store result
    this.results.set(key, { result, timestamp: Date.now() });

    return result;
  }
}

Database-Persisted Version

typescript
// Schema for idempotency storage
interface IdempotencyRecord {
  key: string; // Primary key
  result: string; // JSON serialized result
  createdAt: Date;
  expiresAt: Date;
  requestParams?: string; // For debugging/verification
}

async function idempotentOperation<T>(
  db: Database,
  key: string,
  operation: () => Promise<T>,
  options: {
    ttlMs?: number;
    verifyParams?: object; // Optionally verify params match
  } = {},
): Promise<T> {
  const { ttlMs = 86400000, verifyParams } = options;
  const now = new Date();

  // Check for existing record
  const existing = await db.idempotencyKeys.findUnique({
    where: { key },
  });

  if (existing) {
    // Verify not expired
    if (existing.expiresAt > now) {
      // Optionally verify params match
      if (verifyParams && existing.requestParams) {
        const storedParams = JSON.parse(existing.requestParams);
        if (!deepEqual(storedParams, verifyParams)) {
          throw new Error("Idempotency key reused with different parameters");
        }
      }

      console.log(`Returning cached result for idempotency key: ${key}`);
      return JSON.parse(existing.result);
    } else {
      // Delete expired record
      await db.idempotencyKeys.delete({ where: { key } });
    }
  }

  // Execute operation
  const result = await operation();

  // Store result with expiration
  await db.idempotencyKeys.create({
    data: {
      key,
      result: JSON.stringify(result),
      requestParams: verifyParams ? JSON.stringify(verifyParams) : null,
      createdAt: now,
      expiresAt: new Date(now.getTime() + ttlMs),
    },
  });

  return result;
}

// Usage
const payment = await idempotentOperation(
  db,
  idempotencyKey,
  () => paymentProcessor.charge(amount, customerId),
  {
    ttlMs: 172800000, // 48 hours for payments
    verifyParams: { amount, customerId },
  },
);

Best Practices

1. Always Return the Same Response

When returning a cached idempotency result, return the exact same response including:

  • Same status code
  • Same headers
  • Same body
  • Same timestamps
typescript
// Store the full response
interface StoredResponse {
  status: number;
  headers: Record<string, string>;
  body: any;
  timestamp: Date;
}

2. Handle Concurrent Requests

Two requests with the same idempotency key might arrive simultaneously:

typescript
async function executeIdempotently<T>(key: string, operation: () => Promise<T>): Promise<T> {
  // Use database transaction to prevent race conditions
  return await db.transaction(async (tx) => {
    const existing = await tx.idempotencyKeys.findUnique({ where: { key } });

    if (existing) {
      return JSON.parse(existing.result);
    }

    const result = await operation();

    await tx.idempotencyKeys.create({
      data: { key, result: JSON.stringify(result), expiresAt: ... }
    });

    return result;
  });
}

3. Idempotency Key in Response

Always include the idempotency key in the response:

typescript
// Request
POST /payments
Idempotency-Key: abc123

// Response
201 Created
X-Idempotency-Key: abc123

This allows clients to verify their request was processed.

4. HTTP Methods and Idempotency

HTTP MethodIdempotent by Default?Should Use Idempotency Key?
GET✅ YesNo
HEAD✅ YesNo
OPTIONS✅ YesNo
PUT✅ YesOptional (for idempotency checks)
DELETE✅ YesOptional (for idempotency checks)
POST❌ NoYes
PATCH❌ NoYes

Common Pitfalls

  1. Not persisting keys: Lost on restart, breaks idempotency
  2. No expiration: Storage grows indefinitely
  3. Wrong scope: Keys collide between users or operations
  4. Changing stored results: Must return exact same response
  5. Ignoring concurrent requests: Race conditions cause duplicate processing
  6. Not verifying parameters: Reusing key with different params causes confusion

Production Checklist

  • ✓ Use idempotency keys for all POST/PATCH operations
  • ✓ Generate keys client-side (UUID recommended)
  • ✓ Persist keys in database (not just memory)
  • ✓ Set appropriate TTL for your use case
  • ✓ Return exact same response on idempotency hit
  • ✓ Handle concurrent requests with transactions
  • ✓ Return idempotency key in response headers
  • ✓ Verify parameters match on key reuse (optional but recommended)
  • ✓ Monitor idempotency key usage and hit rates
  • ✓ Document idempotency behavior for API consumers

Putting It All Together

The Defense in Depth Approach

These patterns work best when used together as a layered defense:

┌─────────────────────────────────────────────────────────┐
│                    Request Flow                         │
└─────────────────────────────────────────────────────────┘


              ┌────────────────────────┐
              │   1. Rate Limiting     │  ← Protect your resources
              │   (Token Bucket)       │     from being overwhelmed
              └──────────┬─────────────┘
                         │ Allowed

              ┌────────────────────────┐
              │   2. Idempotency Check │  ← Prevent duplicate
              │   (Database Lookup)    │     processing
              └──────────┬─────────────┘
                         │ New request

              ┌────────────────────────┐
              │   3. Circuit Breaker    │  ← Stop calling failing
              │   (State Check)        │     downstream services
              └──────────┬─────────────┘
                         │ Closed/Half-Open

              ┌────────────────────────┐
              │   4. Retry Logic        │  ← Handle transient
              │   (Backoff + Jitter)   │     failures
              └──────────┬─────────────┘
                         │ Success or Exhausted

              ┌────────────────────────┐
              │   5. Primary Operation │  ← Actual business logic
              └──────────┬─────────────┘

                         ├── Success → Cache result

                         └── Failure → Fallback strategies

Example: Resilient Order Service

typescript
class ResilientOrderService {
  private circuitBreaker: CircuitBreaker;
  private rateLimiter: TokenBucketRateLimiter;
  private idempotencyService: IdempotencyService;
  private cacheService: CacheService;
  private fallbackService: OrderService;

  constructor(
    private database: Database,
    private paymentGateway: PaymentGateway,
    private inventoryService: InventoryService,
    private messageQueue: MessageQueue,
  ) {
    // Circuit breaker for payment gateway (can be flaky)
    this.circuitBreaker = new CircuitBreaker({
      failureThreshold: 5,
      resetTimeoutMs: 30000, // 30 seconds
      halfOpenSuccessThreshold: 3,
    });

    // Rate limit: 100 orders per minute per user
    this.rateLimiter = new TokenBucketRateLimiter({
      capacity: 100,
      refillRate: 100 / 60, // ~1.67 per second
      refillInterval: 1000,
    });

    // Idempotency: 24 hour TTL for orders
    this.idempotencyService = new IdempotencyService({ ttlMs: 86400000 });
  }

  async createOrder(request: CreateOrderRequest): Promise<Order> {
    const { idempotencyKey, userId, items, paymentMethod } = request;

    // 1. Rate limiting (per user)
    if (!this.rateLimiter.isAllowed(userId, 1)) {
      throw new RateLimitError({
        limit: 100,
        window: "1 minute",
        retryAfter: 60,
      });
    }

    // 2. Idempotency check
    return await this.idempotencyService.execute(
      idempotencyKey,
      async () => await this.processOrder(request),
      { ttlMs: 86400000, verifyParams: request },
    );
  }

  private async processOrder(request: CreateOrderRequest): Promise<Order> {
    const { userId, items, paymentMethod } = request;

    try {
      // 3. Reserve inventory (with fallback)
      const inventory = await this.withFallback(
        () => this.inventoryService.reserve(items),
        async () => {
          // Fallback: Check inventory optimistically
          const available = await this.cacheService.getInventory();
          return this.inventoryService.reserveOptimistically(items, available);
        },
        () => ({ reserved: true, optimistic: true }),
      );

      if (!inventory.reserved) {
        throw new OutOfStockError();
      }

      // 4. Process payment (with circuit breaker + retry)
      const payment = await this.circuitBreaker.execute(async () => {
        return await retryWithBackoffAndJitter(
          () =>
            this.paymentGateway.charge({
              amount: this.calculateTotal(items),
              method: paymentMethod,
              userId,
            }),
          { maxRetries: 3, baseDelayMs: 100 },
        );
      });

      // 5. Create order in database
      const order = await this.database.orders.create({
        userId,
        items,
        paymentId: payment.id,
        status: "CONFIRMED",
        createdAt: new Date(),
      });

      // 6. Cache for future reads
      await this.cacheService.set(`order:${order.id}`, order, { ttl: 300 });

      // 7. Publish order event (async, non-blocking)
      this.messageQueue
        .publish("orders.created", { orderId: order.id })
        .catch((err) => {
          console.error("Failed to publish order event:", err);
        });

      return order;
    } catch (error) {
      // Handle different error types
      if (error instanceof CircuitBreakerOpenError) {
        // Payment gateway is down, queue for later
        await this.messageQueue.publish("orders.pending", { request });
        return this.createPendingOrder(request, "PAYMENT_UNAVAILABLE");
      }

      if (error instanceof PaymentError) {
        // Payment failed, don't retry
        throw new OrderCreationError("Payment failed", { code: error.code });
      }

      if (error instanceof DatabaseError) {
        // Database issue, try fallback
        try {
          return await this.fallbackService.createOrder(request);
        } catch (fallbackError) {
          // Last resort: Queue for processing
          await this.messageQueue.publish("orders.pending", { request });
          return this.createPendingOrder(request, "SYSTEM_UNAVAILABLE");
        }
      }

      throw error;
    }
  }

  private async withFallback<T>(
    primary: () => Promise<T>,
    fallback: () => Promise<T>,
    defaultValue: () => T,
  ): Promise<T> {
    try {
      return await primary();
    } catch (primaryError) {
      console.error("Primary failed, trying fallback:", primaryError);
      try {
        return await fallback();
      } catch (fallbackError) {
        console.error("Fallback failed, using default:", fallbackError);
        return defaultValue();
      }
    }
  }

  private createPendingOrder(
    request: CreateOrderRequest,
    reason: string,
  ): Order {
    return {
      id: generateId(),
      userId: request.userId,
      items: request.items,
      status: "PENDING",
      pendingReason: reason,
      createdAt: new Date(),
    };
  }
}

Configuration Summary

PatternKey ParametersTypical Values
RetrymaxRetries, baseDelayMs3-5 retries, 100-500ms base
Circuit BreakerfailureThreshold, resetTimeoutMs5-10 failures, 30-60s timeout
Rate Limitingcapacity, refillRate100-1000 req, window varies
IdempotencyttlMs1-48 hours (by use case)

Monitoring Checklist

For a resilient system, monitor:

  • ✓ Retry rate and success rate
  • ✓ Circuit breaker state transitions
  • ✓ Rate limit violations
  • ✓ Idempotency key hit rate
  • ✓ Fallback usage frequency
  • ✓ End-to-end latency
  • ✓ Error rates by type

Key Takeaways

  1. Failures are inevitable—design your system to handle them gracefully from the start
  2. Retry with backoff + jitter for transient failures, but never retry non-idempotent operations
  3. Circuit breakers prevent cascading failures by stopping calls to unhealthy services
  4. Rate limiting protects resources; choose the algorithm based on your precision vs. memory trade-off
  5. Graceful fallback means degraded service is better than no service—always have a plan B
  6. Idempotency keys make non-idempotent operations safe to retry, essential for distributed systems
  7. These patterns work together as layers of defense, not alternatives to each other
  8. Monitor everything—you can't improve what you don't measure
  9. Test failure paths—chaos engineering and failure injection reveal weak points
  10. Document your reliability strategy—your team needs to understand why these patterns exist

Remember: A system that fails gracefully is more reliable than one that never fails but goes down hard when it does.

Released under the MIT License.