Reliability Patterns

Building resilient systems requires implementing patterns that handle failures gracefully. In distributed systems, failures are inevitable—networks go down, services get overloaded, databases become slow, and bugs slip through. The question isn't whether failures will happen, but how gracefully your system handles them.

This guide covers the essential reliability patterns every backend engineer should know to build systems that stay operational when things go wrong.

Retry with Exponential Backoff

The Problem

In distributed systems, failures are often transient—they resolve themselves given time. A network packet might be dropped, a database connection might be momentarily exhausted, or a dependent service might be restarting. If you simply retry immediately when a failure occurs, you can actually make the problem worse.

Consider this scenario: 1,000 clients simultaneously call a service that's overloaded. The service fails, and all 1,000 clients retry immediately. Now the service faces 2,000 requests and fails again. This creates a thundering herd that can bring down the entire system.

The Solution

Exponential backoff increases the wait time between retries exponentially (100ms, 200ms, 400ms, 800ms...). This gives the failing service time to recover while reducing load during the retry window.

Jitter adds randomness to the delay, ensuring that retries are spread out over time rather than happening in synchronized waves. Without jitter, all clients with the same retry configuration would retry at nearly the same time, recreating the thundering herd problem.

When to Use Retry

Retry is appropriate for:

Scenario	Should Retry?	Reason
Network timeouts	✅ Yes	Transient connection issues
5xx server errors	✅ Yes	Server is overloaded or restarting
Rate limit exceeded (429)	✅ Yes	Wait and retry after limit resets
4xx client errors	❌ No	Client error won't fix on retry
Authentication failures (401)	❌ No	Invalid credentials won't work later
Resource not found (404)	❌ No	Resource genuinely doesn't exist
Business logic errors	❌ No	Not a transient failure

Implementation Considerations

1. Never Retry Non-Idempotent Operations by Default

If an operation isn't idempotent, retrying can cause duplicate side effects:

typescript

// BAD: This could charge the customer twice
await retry(() => paymentService.charge(customerId, amount));

// GOOD: Only retry if the operation is idempotent
await retry(() =>
  paymentService.charge(customerId, amount, {
    idempotencyKey: generateIdempotencyKey(),
  }),
);

2. Set Appropriate Retry Limits

More retries aren't always better:

Use Case	Max Retries	Reason
Internal service calls	3-5	Balance resilience vs latency
External API calls	2-3	Respect external rate limits
User-facing requests	1-2	Fast fail for better UX
Background jobs	5-10	Can tolerate longer delays

3. Always Implement Jitter

Without jitter, retries can synchronize and create new problems:

typescript

// WITHOUT jitter - all clients retry at 100ms, 200ms, 400ms...
// WITH jitter - clients retry at 87ms, 124ms, 156ms... (spread out)

4. Log Retry Attempts

For debugging, track retry behavior:

typescript

logger.info(`Retrying ${operationName}`, {
  attempt: attempt + 1,
  maxRetries,
  delayMs,
  error: error.message,
});

Jitter Strategies Explained

Strategy	Formula	Characteristics
Full Jitter	`random(0, base * 2^attempt)`	Best overall choice; maximizes spread but may wait too long
Equal Jitter	`base * 2^attempt + random(0, base)`	Reduces max wait time; good when latency matters
Decorrelated Jitter	`random(base, min(cap, base * 3 * rand))`	Adaptive; spreads well over longer periods

Full Jitter is the default recommendation because it provides the best spread while being simple to implement.

Common Pitfalls

Retrying forever: Always set a maximum retry count
Retrying non-idempotent operations: Can cause data corruption
No jitter: Creates thundering herd on retry
No timeout on retries: A slow downstream service can block requests indefinitely
Retrying client errors: Won't work, wastes resources

Production Checklist

✓ Only retry transient failures (5xx, timeouts, 429)
✓ Always use exponential backoff
✓ Always add jitter to retry delays
✓ Set maximum retry attempts
✓ Set overall timeout for all retries combined
✓ Ensure retried operations are idempotent
✓ Log retry attempts with context
✓ Monitor retry rates in metrics

Circuit Breaker

The Problem

When a downstream service becomes unavailable or severely degraded, continuously retrying requests to it can:

Waste resources (CPU, connections, memory) on doomed requests
Increase latency for all callers
Cause cascading failures that bring down your entire system

Imagine a payment gateway that's down. If every order processing request tries to call the gateway, waits for timeout, then retries, your entire order system becomes slow and overloaded. Even orders that don't need the payment gateway are affected because threads are blocked waiting on the failed gateway.

The Solution

The Circuit Breaker pattern wraps calls to a service and monitors failures. When failures exceed a threshold, the breaker "trips" and immediately rejects new calls without even attempting them. This:

Fails fast instead of waiting for timeouts
Reduces load on the failing service
Gives the service time to recover
Prevents cascading failures

How It Works

The circuit breaker has three states:

     ┌──────────────┐
     │    CLOSED    │ ← Normal state: requests pass through
     └──────┬───────┘   Failures increment counter
            │           Successes reset counter
            │ When failureCount ≥ threshold
            ▼
     ┌──────────────┐
     │     OPEN     │ ← Failing state: requests rejected immediately
     └──────┬───────┘   After resetTimeout, transition to HALF_OPEN
            │
            │ timeout expires
            ▼
     ┌──────────────┐
     │   HALF-OPEN  │ ← Testing state: allow some requests through
     └──────┬───────┘   Success → CLOSED, Failure → OPEN
            │
            ├── success threshold met → CLOSED
            └── any failure → OPEN

When to Use Circuit Breaker

Circuit breakers are essential for:

Scenario	Why Circuit Breaker?
External API calls	APIs can go down or throttle
Database connections	Prevents connection pool exhaustion
Microservice dependencies	Prevents cascading failures
Expensive operations	Fails fast instead of waiting
Rate-limited services	Respects limits by backing off

Configuration Guidelines

Failure Threshold

The number of consecutive failures before opening the circuit:

Traffic Pattern	Recommended Threshold
Low volume (< 10 req/s)	3-5
Medium volume (10-100 req/s)	5-10
High volume (> 100 req/s)	10-20

Too low: Circuit trips too easily on normal variance Too high: Too many failures occur before protection kicks in

Reset Timeout

How long to stay open before testing recovery:

Scenario	Recommended Timeout
Quick recovery expected (restarts)	10-30 seconds
Normal operations	30-60 seconds
Slow recovery (database issues)	2-5 minutes

Half-Open Success Threshold

How many consecutive successes in HALF-OPEN before closing:

Recommendation	Value
Default	3-5
High-stakes operations	5-10
Low-stakes operations	1-2

This confirms stable recovery before fully closing.

Implementation Best Practices

1. Distinguish Between Failures

Not all errors should increment the failure counter:

typescript

class CircuitBreaker {
  private shouldCountAsFailure(error: Error): boolean {
    // Count these as failures
    if (error instanceof TimeoutError) return true;
    if (error instanceof NetworkError) return true;
    if (error.response?.status >= 500) return true;

    // Don't count these as failures
    if (error.response?.status === 404) return false; // Not found isn't a service failure
    if (error.response?.status === 401) return false; // Auth issue, not service issue

    // Default: count as failure for safety
    return true;
  }
}

2. Provide Fallback Responses

When the circuit is open, don't just throw an error. Provide a graceful fallback:

typescript

try {
  return await circuitBreaker.execute(() => fetchRecommendations());
} catch (error) {
  if (error.message.includes("Circuit breaker is OPEN")) {
    // Return cached or default recommendations
    return await getCachedRecommendations();
  }
  throw error;
}

3. Monitor Circuit State

Expose metrics for observability:

typescript

metrics.gauge("circuit_breaker.state", { service: "payment-gateway" }, state);
metrics
  .counter("circuit_breaker.failures", { service: "payment-gateway" })
  .increment();
metrics
  .counter("circuit_breaker.successes", { service: "payment-gateway" })
  .increment();

Common Pitfalls

Too sensitive: Circuit trips on normal variance
Too slow to recover: Stays open too long, rejecting valid requests
No fallback: Users see errors when circuit is open
No monitoring: Can't tell when circuits are tripping
Wrong error classification: Opens on client errors (4xx) instead of service issues (5xx)
Missing HALF-OPEN: Goes directly to CLOSED, risking immediate re-trip

Circuit Breaker vs Retry

Aspect	Circuit Breaker	Retry
Purpose	Stop calling failing service	Retry individual failed calls
Scope	Service-level protection	Request-level resilience
State	Has state (CLOSED, OPEN, HALF-OPEN)	Stateless (typically)
Use Together?	✅ Yes	✅ Yes

Best Practice: Use circuit breaker to protect against cascading failures, and retry to handle transient issues. They complement each other.

Production Checklist

✓ Configure appropriate failure threshold for your traffic volume
✓ Set reset timeout based on expected recovery time
✓ Implement HALF-OPEN state to test recovery
✓ Distinguish service failures from client errors
✓ Provide fallback responses when circuit is open
✓ Expose circuit state and metrics
✓ Log state transitions for debugging
✓ Test circuit trip and recovery behavior
✓ Monitor circuit breaker tripping in production

Rate Limiting

The Problem

Without rate limiting, your service is vulnerable to:

Abuse: Malicious actors can overwhelm your service with requests
Accidental overload: Bugs or misconfigurations can cause excessive traffic
Cost overruns: External API calls, database queries, and third-party services often cost per request
Fairness: A single user consuming all resources impacts others
Downstream protection: Prevent overwhelming dependent services

The Solution

Rate limiting controls the frequency of requests to protect resources, ensure fair usage, and prevent abuse. It enforces limits on how many requests can be made within a time window.

Rate Limiting Scope

Scope	Description	Use Case
Global	All requests across the entire service	Overall capacity protection
Per User	Limits per authenticated user	Fair usage enforcement
Per IP	Limits per IP address	DDoS mitigation, guest users
Per API Key	Limits per API key	API tier management
Per Endpoint	Limits per specific endpoint	Protect expensive operations

Algorithm Comparison

Algorithm	Memory	Precision	Best For
Fixed Window	O(1) per key	Low (can burst at edges)	Simple limits, low memory
Sliding Window	O(n) per key	High	Strict enforcement, fairness
Token Bucket	O(1) per key	Medium (smooth)	API rate limiting, burst allowance
Leaky Bucket	O(1) per key	Medium (constant)	Traffic shaping, smoothing

Fixed Window Counter

How it works: Divide time into fixed windows (e.g., 1 minute). Count requests in each window. Reset count at window boundary.

typescript

class FixedWindowRateLimiter {
  private counters: Map<string, { count: number; windowStart: number }> =
    new Map();

  isAllowed(key: string, maxRequests: number, windowMs: number): boolean {
    const now = Date.now();
    const current = this.counters.get(key);

    // Start new window if needed
    if (!current || now - current.windowStart >= windowMs) {
      this.counters.set(key, { count: 1, windowStart: now });
      return true;
    }

    // Check limit
    if (current.count < maxRequests) {
      current.count++;
      return true;
    }

    return false;
  }
}

Pros: Simple, constant memory, fast

Cons: Can burst at window boundaries (e.g., 100 requests at :59 and 100 at :01 = 200 in 2 seconds for a 100/minute limit)

Sliding Window Log

How it works: Store timestamp of each request. Remove timestamps outside the window. Count remaining.

typescript

class SlidingWindowRateLimiter {
  private timestamps: Map<string, number[]> = new Map();

  isAllowed(key: string, maxRequests: number, windowMs: number): boolean {
    const now = Date.now();
    const windowStart = now - windowMs;

    // Get and filter timestamps for this key
    let timestamps = this.timestamps.get(key) || [];
    timestamps = timestamps.filter((ts) => ts > windowStart);

    // Check if under limit
    if (timestamps.length < maxRequests) {
      timestamps.push(now);
      this.timestamps.set(key, timestamps);
      return true;
    }

    return false;
  }
}

Pros: Precise, no boundary bursts

Cons: O(n) memory per key, slower with high request volume

Token Bucket

How it works: Imagine a bucket that fills with tokens at a constant rate. Each request consumes a token. If bucket is empty, request is rejected.

typescript

class TokenBucketRateLimiter {
  private tokens: Map<string, { count: number; lastRefill: number }> =
    new Map();

  constructor(
    private readonly capacity: number, // Max tokens in bucket
    private readonly refillRate: number, // Tokens per second
    private readonly refillInterval: number = 1000,
  ) {}

  isAllowed(key: string, tokensNeeded: number = 1): boolean {
    const now = Date.now();
    let state = this.tokens.get(key);

    // Initialize new bucket
    if (!state) {
      state = { count: this.capacity, lastRefill: now };
      this.tokens.set(key, state);
    }

    // Refill tokens
    const elapsed = now - state.lastRefill;
    if (elapsed >= this.refillInterval) {
      const tokensToAdd = Math.floor(
        (elapsed / this.refillInterval) * this.refillRate,
      );
      state.count = Math.min(this.capacity, state.count + tokensToAdd);
      state.lastRefill = now;
    }

    // Check if enough tokens
    if (state.count >= tokensNeeded) {
      state.count -= tokensNeeded;
      return true;
    }

    return false;
  }
}

Pros: O(1) memory, allows bursts (up to capacity), smooths traffic

Cons: More complex configuration

Choosing the Right Algorithm

Scenario	Recommended Algorithm	Why
Simple API rate limiting	Token Bucket	Smooth traffic, allows brief bursts
Strict fairness requirement	Sliding Window	Precise enforcement
High-volume, memory-constrained	Fixed Window	Lowest memory overhead
Traffic shaping (output)	Leaky Bucket	Constant output rate

Rate Limiting Headers

When rate limiting, return informative headers:

Header	Format	Purpose
`X-RateLimit-Limit`	`100`	Total requests allowed
`X-RateLimit-Remaining`	`97`	Requests left in window
`X-RateLimit-Reset`	`1711838400`	Unix timestamp when window resets
`Retry-After`	`60`	Seconds until user can retry (when limited)

Common Pitfalls

Wrong scope: Using global limits when per-user is needed (or vice versa)
No burst allowance: Rejecting legitimate bursts of activity
Tight limits: Blocking normal user behavior
No informative headers: Users can't tell when they'll be unblocked
In-memory only: Limits reset on restart (use Redis for distributed systems)
Ignoring time zones: Window boundaries can shift unexpectedly

Production Checklist

✓ Choose appropriate scope (global, per-user, per-IP)
✓ Select algorithm based on requirements
✓ Use persistent storage (Redis) for distributed systems
✓ Return rate limit headers in responses
✓ Implement graceful degradation (not hard errors)
✓ Log rate limit violations for abuse detection
✓ Monitor rate limit effectiveness
✓ Test limit boundary conditions

Graceful Failure and Fallback

The Problem

When something goes wrong, most systems default to throwing errors and showing users generic error messages. This creates a terrible user experience and can make minor issues feel like complete outages.

In production, partial failures are normal. A single dependency going down shouldn't take down your entire application.

The Solution

Graceful failure means your system continues to provide useful functionality even when some components are unavailable. Instead of all-or-nothing, you provide degraded service.

Fallback means having alternative ways to accomplish a task when the primary method fails.

The Fallback Hierarchy

Primary Operation
        │
        ├── Success → Return Result
        │
        └── Failure
                │
                ├── Try Cache → Return Stale Data
                │
                └── Cache Miss
                        │
                        ├── Try Secondary Service → Return Result
                        │
                        └── Secondary Fails
                                │
                                └── Return Safe Default

Fallback Strategies

Strategy	Description	Example	Trade-off
Static Fallback	Return predefined default value	Empty list, default user	Simple but may not be useful
Cached Response	Serve previously cached data	5-minute old product data	Stale but functional
Feature Disable	Hide non-critical functionality	Disable recommendations	Core features still work
Queue for Later	Accept request, process asynchronously	Email sending	Async processing complexity
Secondary Service	Use alternative implementation	Read from replica DB	Additional cost/complexity
Partial Response	Return what's available	Show loaded products, hide prices	Incomplete but useful

Example: Resilient User Service

typescript

class ResilientUserService implements UserService {
  constructor(
    private primary: UserService, // Main database
    private cache: CacheService, // Redis cache
    private fallback: UserService, // Read replica or backup
    private metrics: MetricsService,
  ) {}

  async getUser(id: string): Promise<User> {
    try {
      // 1. Try primary source
      const user = await this.primary.getUser(id);
      await this.cache.set(`user:${id}`, user, { ttl: 300 }); // Cache for 5 minutes
      return user;
    } catch (primaryError) {
      this.metrics.increment("user_service.primary_failure");

      try {
        // 2. Try cache
        const cached = await this.cache.get<User>(`user:${id}`);
        if (cached) {
          this.metrics.increment("user_service.cache_hit");
          cached.stale = true; // Mark as stale so UI can handle
          return cached;
        }
      } catch (cacheError) {
        this.metrics.increment("user_service.cache_failure");
      }

      try {
        // 3. Try fallback service
        const user = await this.fallback.getUser(id);
        this.metrics.increment("user_service.fallback_success");
        return user;
      } catch (fallbackError) {
        this.metrics.increment("user_service.fallback_failure");

        // 4. Return safe default
        return this.getDefaultUser(id);
      }
    }
  }

  private getDefaultUser(id: string): User {
    return {
      id,
      name: "Guest User",
      isDefault: true,
      isAnonymous: true,
    };
  }
}

Principles of Graceful Degradation

1. Identify Critical vs. Non-Critical Paths

Not all features are equally important:

Critical (Must Work)	Non-Critical (Can Fail)
Authentication	Recommendations
Core transactions	Search filters
Payment processing	Analytics tracking
Data persistence	Real-time notifications
Security features	Social sharing

2. Design Fallbacks Early

Don't add fallbacks as an afterthought. Design them into your architecture from the start:

typescript

// Design with fallback in mind
interface ProductCatalog {
  getProducts(): Promise<Product[]>;
  // What if this fails? What's the fallback?
}

3. Make Fallbacks Visible to Users

When serving degraded data, let users know:

typescript

interface User {
  id: string;
  name: string;
  stale?: boolean; // Flag indicates data might be old
  isDefault?: boolean; // Flag indicates this is a default
}

// UI can show: "Showing cached data (may be outdated)"

4. Monitor Fallback Usage

If you're constantly hitting fallbacks, something is wrong:

typescript

if (metrics.get("user_service.cache_hit").rate > 0.5) {
  alert("Primary user service is degraded! 50% of requests served from cache.");
}

Real-World Examples

Netflix's Fallback Strategy

When Netflix's recommendation engine fails, they don't show an error. Instead:

Fall back to popular/trending content
Fall back to user's watch history
Fall back to curated collections
Show a message: "Recommendations temporarily unavailable, here's what's popular"

Amazon's Add to Cart

When the inventory service is unavailable:

Still allow adding to cart
Show "in stock" with caveat
Validate inventory at checkout
If out of stock then, notify user

Slack's Message Sending

When the primary message store is down:

Store message locally
Show "pending" indicator
Retry in background
Sync when connection restored

Common Pitfalls

Silent failures: Users don't know something is degraded
Cascading fallbacks: Fallback depends on the same failing system
Stale data freshness: Not indicating when data is old
No fallback: All-or-nothing approach
Complex fallback chains: Too many levels make debugging hard
Ignoring errors: Not logging or monitoring fallback usage

Production Checklist

✓ Identify critical vs. non-critical features
✓ Design fallback paths for all critical operations
✓ Use caching as a primary fallback strategy
✓ Implement safe defaults for all user-facing data
✓ Mark degraded responses (stale, default, etc.)
✓ Monitor fallback usage rates
✓ Alert when fallback usage is high
✓ Test fallback paths regularly

Idempotency

The Problem

In distributed systems, networks are unreliable. A request might:

Timeout before reaching the server
Reach the server but timeout waiting for response
Get a 5xx error from the server
Succeed but the response gets lost

In all these cases, the client naturally wants to retry the request. But retrying non-idempotent operations causes problems:

Client → Server: POST /charge $100 (timeout)
Client → Server: POST /charge $100 (timeout)
Client → Server: POST /charge $100 (success)

Result: Customer charged $300 instead of $100!

The Solution

An operation is idempotent if performing it multiple times produces the same result as performing it once.

Operation	Idempotent?	Reason
`GET /users/1`	✅ Yes	Reading doesn't change state
`PUT /users/1`	✅ Yes	Same payload = same final state
`DELETE /users/1`	✅ Yes	Once deleted, stays deleted
`POST /orders`	❌ No	Creates new order each time
`POST /payments`	❌ No	Charges each time

For non-idempotent operations, use an idempotency key to make them safe to retry.

Idempotency Key Pattern

The idempotency key is a unique identifier provided by the client. The server:

Checks if this key has been processed before
If yes, returns the stored result
If no, processes the request and stores the result with the key

Request 1: POST /payments
Headers: Idempotency-Key: abc123
Body: { amount: 100 }
→ Process payment, store result for abc123, return result

Request 2: POST /payments (retry)
Headers: Idempotency-Key: abc123
Body: { amount: 100 }
→ Return stored result for abc123 (no new charge)

Request 3: POST /payments (different request)
Headers: Idempotency-Key: xyz789
Body: { amount: 50 }
→ Process payment, store result for xyz789, return result

Idempotency Key Design

1. Key Generation

typescript

// Option 1: Client-generated UUID (most common)
const idempotencyKey = crypto.randomUUID(); // "550e8400-e29b-41d4-a716-446655440000"

// Option 2: Deterministic based on request details
function generateDeterministicKey(
  userId: string,
  operation: string,
  params: object,
): string {
  const hash = createHash("sha256");
  hash.update(userId);
  hash.update(operation);
  hash.update(JSON.stringify(params));
  return hash.digest("hex"); // "a3d5e9f2..."
}

// Option 3: Combination
const idempotencyKey = `${userId}:${operation}:${uuid()}`;

Recommendation: Use client-generated UUIDs for maximum flexibility. Use deterministic keys only if you need deduplication across clients.

2. Key Scope

Idempotency keys should be scoped to prevent collisions:

Scope	Example	When to Use
Global	`abc123`	Simple systems, single operation type
Per-user	`user123:abc123`	Multi-tenant systems
Per-operation	`payment:abc123`	Multiple operation types
Combined	`user123:payment:abc123`	Complex systems

3. Key Expiration

Don't store idempotency keys forever:

Use Case	Recommended TTL
Payments	24-48 hours (dispute window)
Orders	7 days (typical order lifecycle)
Email sending	1 hour (retry window)
General purpose	24 hours

Implementation

Simple In-Memory Version

typescript

class IdempotencyService {
  private results = new Map<string, { result: any; timestamp: number }>();

  async execute<T>(
    key: string,
    operation: () => Promise<T>,
    ttlMs: number = 86400000, // 24 hours
  ): Promise<T> {
    const existing = this.results.get(key);

    // Return cached result if exists and not expired
    if (existing) {
      if (Date.now() - existing.timestamp < ttlMs) {
        console.log(`Returning cached result for idempotency key: ${key}`);
        return existing.result;
      } else {
        // Expired, remove it
        this.results.delete(key);
      }
    }

    // Execute operation
    const result = await operation();

    // Store result
    this.results.set(key, { result, timestamp: Date.now() });

    return result;
  }
}

Database-Persisted Version

typescript

// Schema for idempotency storage
interface IdempotencyRecord {
  key: string; // Primary key
  result: string; // JSON serialized result
  createdAt: Date;
  expiresAt: Date;
  requestParams?: string; // For debugging/verification
}

async function idempotentOperation<T>(
  db: Database,
  key: string,
  operation: () => Promise<T>,
  options: {
    ttlMs?: number;
    verifyParams?: object; // Optionally verify params match
  } = {},
): Promise<T> {
  const { ttlMs = 86400000, verifyParams } = options;
  const now = new Date();

  // Check for existing record
  const existing = await db.idempotencyKeys.findUnique({
    where: { key },
  });

  if (existing) {
    // Verify not expired
    if (existing.expiresAt > now) {
      // Optionally verify params match
      if (verifyParams && existing.requestParams) {
        const storedParams = JSON.parse(existing.requestParams);
        if (!deepEqual(storedParams, verifyParams)) {
          throw new Error("Idempotency key reused with different parameters");
        }
      }

      console.log(`Returning cached result for idempotency key: ${key}`);
      return JSON.parse(existing.result);
    } else {
      // Delete expired record
      await db.idempotencyKeys.delete({ where: { key } });
    }
  }

  // Execute operation
  const result = await operation();

  // Store result with expiration
  await db.idempotencyKeys.create({
    data: {
      key,
      result: JSON.stringify(result),
      requestParams: verifyParams ? JSON.stringify(verifyParams) : null,
      createdAt: now,
      expiresAt: new Date(now.getTime() + ttlMs),
    },
  });

  return result;
}

// Usage
const payment = await idempotentOperation(
  db,
  idempotencyKey,
  () => paymentProcessor.charge(amount, customerId),
  {
    ttlMs: 172800000, // 48 hours for payments
    verifyParams: { amount, customerId },
  },
);

Best Practices

1. Always Return the Same Response

When returning a cached idempotency result, return the exact same response including:

Same status code
Same headers
Same body
Same timestamps

typescript

// Store the full response
interface StoredResponse {
  status: number;
  headers: Record<string, string>;
  body: any;
  timestamp: Date;
}

2. Handle Concurrent Requests

Two requests with the same idempotency key might arrive simultaneously:

typescript

async function executeIdempotently<T>(key: string, operation: () => Promise<T>): Promise<T> {
  // Use database transaction to prevent race conditions
  return await db.transaction(async (tx) => {
    const existing = await tx.idempotencyKeys.findUnique({ where: { key } });

    if (existing) {
      return JSON.parse(existing.result);
    }

    const result = await operation();

    await tx.idempotencyKeys.create({
      data: { key, result: JSON.stringify(result), expiresAt: ... }
    });

    return result;
  });
}

3. Idempotency Key in Response

Always include the idempotency key in the response:

typescript

// Request
POST /payments
Idempotency-Key: abc123

// Response
201 Created
X-Idempotency-Key: abc123

This allows clients to verify their request was processed.

4. HTTP Methods and Idempotency

HTTP Method	Idempotent by Default?	Should Use Idempotency Key?
GET	✅ Yes	No
HEAD	✅ Yes	No
OPTIONS	✅ Yes	No
PUT	✅ Yes	Optional (for idempotency checks)
DELETE	✅ Yes	Optional (for idempotency checks)
POST	❌ No	Yes
PATCH	❌ No	Yes

Common Pitfalls

Not persisting keys: Lost on restart, breaks idempotency
No expiration: Storage grows indefinitely
Wrong scope: Keys collide between users or operations
Changing stored results: Must return exact same response
Ignoring concurrent requests: Race conditions cause duplicate processing
Not verifying parameters: Reusing key with different params causes confusion

Production Checklist

✓ Use idempotency keys for all POST/PATCH operations
✓ Generate keys client-side (UUID recommended)
✓ Persist keys in database (not just memory)
✓ Set appropriate TTL for your use case
✓ Return exact same response on idempotency hit
✓ Handle concurrent requests with transactions
✓ Return idempotency key in response headers
✓ Verify parameters match on key reuse (optional but recommended)
✓ Monitor idempotency key usage and hit rates
✓ Document idempotency behavior for API consumers

Putting It All Together

The Defense in Depth Approach

These patterns work best when used together as a layered defense:

┌─────────────────────────────────────────────────────────┐
│                    Request Flow                         │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
              ┌────────────────────────┐
              │   1. Rate Limiting     │  ← Protect your resources
              │   (Token Bucket)       │     from being overwhelmed
              └──────────┬─────────────┘
                         │ Allowed
                         ▼
              ┌────────────────────────┐
              │   2. Idempotency Check │  ← Prevent duplicate
              │   (Database Lookup)    │     processing
              └──────────┬─────────────┘
                         │ New request
                         ▼
              ┌────────────────────────┐
              │   3. Circuit Breaker    │  ← Stop calling failing
              │   (State Check)        │     downstream services
              └──────────┬─────────────┘
                         │ Closed/Half-Open
                         ▼
              ┌────────────────────────┐
              │   4. Retry Logic        │  ← Handle transient
              │   (Backoff + Jitter)   │     failures
              └──────────┬─────────────┘
                         │ Success or Exhausted
                         ▼
              ┌────────────────────────┐
              │   5. Primary Operation │  ← Actual business logic
              └──────────┬─────────────┘
                         │
                         ├── Success → Cache result
                         │
                         └── Failure → Fallback strategies

Example: Resilient Order Service

typescript

class ResilientOrderService {
  private circuitBreaker: CircuitBreaker;
  private rateLimiter: TokenBucketRateLimiter;
  private idempotencyService: IdempotencyService;
  private cacheService: CacheService;
  private fallbackService: OrderService;

  constructor(
    private database: Database,
    private paymentGateway: PaymentGateway,
    private inventoryService: InventoryService,
    private messageQueue: MessageQueue,
  ) {
    // Circuit breaker for payment gateway (can be flaky)
    this.circuitBreaker = new CircuitBreaker({
      failureThreshold: 5,
      resetTimeoutMs: 30000, // 30 seconds
      halfOpenSuccessThreshold: 3,
    });

    // Rate limit: 100 orders per minute per user
    this.rateLimiter = new TokenBucketRateLimiter({
      capacity: 100,
      refillRate: 100 / 60, // ~1.67 per second
      refillInterval: 1000,
    });

    // Idempotency: 24 hour TTL for orders
    this.idempotencyService = new IdempotencyService({ ttlMs: 86400000 });
  }

  async createOrder(request: CreateOrderRequest): Promise<Order> {
    const { idempotencyKey, userId, items, paymentMethod } = request;

    // 1. Rate limiting (per user)
    if (!this.rateLimiter.isAllowed(userId, 1)) {
      throw new RateLimitError({
        limit: 100,
        window: "1 minute",
        retryAfter: 60,
      });
    }

    // 2. Idempotency check
    return await this.idempotencyService.execute(
      idempotencyKey,
      async () => await this.processOrder(request),
      { ttlMs: 86400000, verifyParams: request },
    );
  }

  private async processOrder(request: CreateOrderRequest): Promise<Order> {
    const { userId, items, paymentMethod } = request;

    try {
      // 3. Reserve inventory (with fallback)
      const inventory = await this.withFallback(
        () => this.inventoryService.reserve(items),
        async () => {
          // Fallback: Check inventory optimistically
          const available = await this.cacheService.getInventory();
          return this.inventoryService.reserveOptimistically(items, available);
        },
        () => ({ reserved: true, optimistic: true }),
      );

      if (!inventory.reserved) {
        throw new OutOfStockError();
      }

      // 4. Process payment (with circuit breaker + retry)
      const payment = await this.circuitBreaker.execute(async () => {
        return await retryWithBackoffAndJitter(
          () =>
            this.paymentGateway.charge({
              amount: this.calculateTotal(items),
              method: paymentMethod,
              userId,
            }),
          { maxRetries: 3, baseDelayMs: 100 },
        );
      });

      // 5. Create order in database
      const order = await this.database.orders.create({
        userId,
        items,
        paymentId: payment.id,
        status: "CONFIRMED",
        createdAt: new Date(),
      });

      // 6. Cache for future reads
      await this.cacheService.set(`order:${order.id}`, order, { ttl: 300 });

      // 7. Publish order event (async, non-blocking)
      this.messageQueue
        .publish("orders.created", { orderId: order.id })
        .catch((err) => {
          console.error("Failed to publish order event:", err);
        });

      return order;
    } catch (error) {
      // Handle different error types
      if (error instanceof CircuitBreakerOpenError) {
        // Payment gateway is down, queue for later
        await this.messageQueue.publish("orders.pending", { request });
        return this.createPendingOrder(request, "PAYMENT_UNAVAILABLE");
      }

      if (error instanceof PaymentError) {
        // Payment failed, don't retry
        throw new OrderCreationError("Payment failed", { code: error.code });
      }

      if (error instanceof DatabaseError) {
        // Database issue, try fallback
        try {
          return await this.fallbackService.createOrder(request);
        } catch (fallbackError) {
          // Last resort: Queue for processing
          await this.messageQueue.publish("orders.pending", { request });
          return this.createPendingOrder(request, "SYSTEM_UNAVAILABLE");
        }
      }

      throw error;
    }
  }

  private async withFallback<T>(
    primary: () => Promise<T>,
    fallback: () => Promise<T>,
    defaultValue: () => T,
  ): Promise<T> {
    try {
      return await primary();
    } catch (primaryError) {
      console.error("Primary failed, trying fallback:", primaryError);
      try {
        return await fallback();
      } catch (fallbackError) {
        console.error("Fallback failed, using default:", fallbackError);
        return defaultValue();
      }
    }
  }

  private createPendingOrder(
    request: CreateOrderRequest,
    reason: string,
  ): Order {
    return {
      id: generateId(),
      userId: request.userId,
      items: request.items,
      status: "PENDING",
      pendingReason: reason,
      createdAt: new Date(),
    };
  }
}

Configuration Summary

Pattern	Key Parameters	Typical Values
Retry	`maxRetries`, `baseDelayMs`	3-5 retries, 100-500ms base
Circuit Breaker	`failureThreshold`, `resetTimeoutMs`	5-10 failures, 30-60s timeout
Rate Limiting	`capacity`, `refillRate`	100-1000 req, window varies
Idempotency	`ttlMs`	1-48 hours (by use case)

Monitoring Checklist

For a resilient system, monitor:

✓ Retry rate and success rate
✓ Circuit breaker state transitions
✓ Rate limit violations
✓ Idempotency key hit rate
✓ Fallback usage frequency
✓ End-to-end latency
✓ Error rates by type

Key Takeaways

Failures are inevitable—design your system to handle them gracefully from the start
Retry with backoff + jitter for transient failures, but never retry non-idempotent operations
Circuit breakers prevent cascading failures by stopping calls to unhealthy services
Rate limiting protects resources; choose the algorithm based on your precision vs. memory trade-off
Graceful fallback means degraded service is better than no service—always have a plan B
Idempotency keys make non-idempotent operations safe to retry, essential for distributed systems
These patterns work together as layers of defense, not alternatives to each other
Monitor everything—you can't improve what you don't measure
Test failure paths—chaos engineering and failure injection reveal weak points
Document your reliability strategy—your team needs to understand why these patterns exist

Remember: A system that fails gracefully is more reliable than one that never fails but goes down hard when it does.

Reliability Patterns ​

Retry with Exponential Backoff ​

The Problem ​

The Solution ​

When to Use Retry ​

Implementation Considerations ​

1. Never Retry Non-Idempotent Operations by Default ​

2. Set Appropriate Retry Limits ​

3. Always Implement Jitter ​

4. Log Retry Attempts ​

Jitter Strategies Explained ​

Common Pitfalls ​

Production Checklist ​

Circuit Breaker ​

The Problem ​

The Solution ​

How It Works ​

When to Use Circuit Breaker ​

Configuration Guidelines ​

Failure Threshold ​

Reset Timeout ​

Half-Open Success Threshold ​

Implementation Best Practices ​

1. Distinguish Between Failures ​

2. Provide Fallback Responses ​

3. Monitor Circuit State ​

Common Pitfalls ​

Circuit Breaker vs Retry ​

Production Checklist ​

Rate Limiting ​

The Problem ​

The Solution ​

Rate Limiting Scope ​

Algorithm Comparison ​

Fixed Window Counter ​

Sliding Window Log ​

Token Bucket ​

Choosing the Right Algorithm ​

Rate Limiting Headers ​

Common Pitfalls ​

Production Checklist ​

Graceful Failure and Fallback ​

The Problem ​

The Solution ​

The Fallback Hierarchy ​

Fallback Strategies ​

Example: Resilient User Service ​

Principles of Graceful Degradation ​

1. Identify Critical vs. Non-Critical Paths ​

2. Design Fallbacks Early ​

3. Make Fallbacks Visible to Users ​

4. Monitor Fallback Usage ​

Real-World Examples ​

Netflix's Fallback Strategy ​

Amazon's Add to Cart ​

Slack's Message Sending ​

Common Pitfalls ​

Production Checklist ​

Idempotency ​

The Problem ​

The Solution ​

Idempotency Key Pattern ​

Idempotency Key Design ​

1. Key Generation ​

2. Key Scope ​

3. Key Expiration ​

Implementation ​

Simple In-Memory Version ​

Database-Persisted Version ​

Best Practices ​

1. Always Return the Same Response ​

2. Handle Concurrent Requests ​

3. Idempotency Key in Response ​

4. HTTP Methods and Idempotency ​

Common Pitfalls ​

Production Checklist ​

Putting It All Together ​

The Defense in Depth Approach ​

Example: Resilient Order Service ​

Configuration Summary ​

Reliability Patterns

Retry with Exponential Backoff

The Problem

The Solution

When to Use Retry

Implementation Considerations

1. Never Retry Non-Idempotent Operations by Default

2. Set Appropriate Retry Limits

3. Always Implement Jitter

4. Log Retry Attempts

Jitter Strategies Explained

Common Pitfalls

Production Checklist

Circuit Breaker

The Problem

The Solution

How It Works

When to Use Circuit Breaker

Configuration Guidelines

Failure Threshold

Reset Timeout

Half-Open Success Threshold

Implementation Best Practices

1. Distinguish Between Failures

2. Provide Fallback Responses

3. Monitor Circuit State

Common Pitfalls

Circuit Breaker vs Retry

Production Checklist

Rate Limiting

The Problem

The Solution

Rate Limiting Scope

Algorithm Comparison

Fixed Window Counter

Sliding Window Log

Token Bucket

Choosing the Right Algorithm

Rate Limiting Headers

Common Pitfalls

Production Checklist

Graceful Failure and Fallback

The Problem

The Solution

The Fallback Hierarchy

Fallback Strategies

Example: Resilient User Service

Principles of Graceful Degradation

1. Identify Critical vs. Non-Critical Paths

2. Design Fallbacks Early

3. Make Fallbacks Visible to Users

4. Monitor Fallback Usage

Real-World Examples

Netflix's Fallback Strategy

Amazon's Add to Cart

Slack's Message Sending

Common Pitfalls

Production Checklist

Idempotency

The Problem

The Solution

Idempotency Key Pattern

Idempotency Key Design

1. Key Generation

2. Key Scope

3. Key Expiration

Implementation

Simple In-Memory Version

Database-Persisted Version

Best Practices

1. Always Return the Same Response

2. Handle Concurrent Requests

3. Idempotency Key in Response

4. HTTP Methods and Idempotency

Common Pitfalls

Production Checklist

Putting It All Together

The Defense in Depth Approach

Example: Resilient Order Service

Configuration Summary