Reliability Patterns
Building resilient systems requires implementing patterns that handle failures gracefully. In distributed systems, failures are inevitable—networks go down, services get overloaded, databases become slow, and bugs slip through. The question isn't whether failures will happen, but how gracefully your system handles them.
This guide covers the essential reliability patterns every backend engineer should know to build systems that stay operational when things go wrong.
Retry with Exponential Backoff
The Problem
In distributed systems, failures are often transient—they resolve themselves given time. A network packet might be dropped, a database connection might be momentarily exhausted, or a dependent service might be restarting. If you simply retry immediately when a failure occurs, you can actually make the problem worse.
Consider this scenario: 1,000 clients simultaneously call a service that's overloaded. The service fails, and all 1,000 clients retry immediately. Now the service faces 2,000 requests and fails again. This creates a thundering herd that can bring down the entire system.
The Solution
Exponential backoff increases the wait time between retries exponentially (100ms, 200ms, 400ms, 800ms...). This gives the failing service time to recover while reducing load during the retry window.
Jitter adds randomness to the delay, ensuring that retries are spread out over time rather than happening in synchronized waves. Without jitter, all clients with the same retry configuration would retry at nearly the same time, recreating the thundering herd problem.
When to Use Retry
Retry is appropriate for:
| Scenario | Should Retry? | Reason |
|---|---|---|
| Network timeouts | ✅ Yes | Transient connection issues |
| 5xx server errors | ✅ Yes | Server is overloaded or restarting |
| Rate limit exceeded (429) | ✅ Yes | Wait and retry after limit resets |
| 4xx client errors | ❌ No | Client error won't fix on retry |
| Authentication failures (401) | ❌ No | Invalid credentials won't work later |
| Resource not found (404) | ❌ No | Resource genuinely doesn't exist |
| Business logic errors | ❌ No | Not a transient failure |
Implementation Considerations
1. Never Retry Non-Idempotent Operations by Default
If an operation isn't idempotent, retrying can cause duplicate side effects:
// BAD: This could charge the customer twice
await retry(() => paymentService.charge(customerId, amount));
// GOOD: Only retry if the operation is idempotent
await retry(() =>
paymentService.charge(customerId, amount, {
idempotencyKey: generateIdempotencyKey(),
}),
);2. Set Appropriate Retry Limits
More retries aren't always better:
| Use Case | Max Retries | Reason |
|---|---|---|
| Internal service calls | 3-5 | Balance resilience vs latency |
| External API calls | 2-3 | Respect external rate limits |
| User-facing requests | 1-2 | Fast fail for better UX |
| Background jobs | 5-10 | Can tolerate longer delays |
3. Always Implement Jitter
Without jitter, retries can synchronize and create new problems:
// WITHOUT jitter - all clients retry at 100ms, 200ms, 400ms...
// WITH jitter - clients retry at 87ms, 124ms, 156ms... (spread out)4. Log Retry Attempts
For debugging, track retry behavior:
logger.info(`Retrying ${operationName}`, {
attempt: attempt + 1,
maxRetries,
delayMs,
error: error.message,
});Jitter Strategies Explained
| Strategy | Formula | Characteristics |
|---|---|---|
| Full Jitter | random(0, base * 2^attempt) | Best overall choice; maximizes spread but may wait too long |
| Equal Jitter | base * 2^attempt + random(0, base) | Reduces max wait time; good when latency matters |
| Decorrelated Jitter | random(base, min(cap, base * 3 * rand)) | Adaptive; spreads well over longer periods |
Full Jitter is the default recommendation because it provides the best spread while being simple to implement.
Common Pitfalls
- Retrying forever: Always set a maximum retry count
- Retrying non-idempotent operations: Can cause data corruption
- No jitter: Creates thundering herd on retry
- No timeout on retries: A slow downstream service can block requests indefinitely
- Retrying client errors: Won't work, wastes resources
Production Checklist
- ✓ Only retry transient failures (5xx, timeouts, 429)
- ✓ Always use exponential backoff
- ✓ Always add jitter to retry delays
- ✓ Set maximum retry attempts
- ✓ Set overall timeout for all retries combined
- ✓ Ensure retried operations are idempotent
- ✓ Log retry attempts with context
- ✓ Monitor retry rates in metrics
Circuit Breaker
The Problem
When a downstream service becomes unavailable or severely degraded, continuously retrying requests to it can:
- Waste resources (CPU, connections, memory) on doomed requests
- Increase latency for all callers
- Cause cascading failures that bring down your entire system
Imagine a payment gateway that's down. If every order processing request tries to call the gateway, waits for timeout, then retries, your entire order system becomes slow and overloaded. Even orders that don't need the payment gateway are affected because threads are blocked waiting on the failed gateway.
The Solution
The Circuit Breaker pattern wraps calls to a service and monitors failures. When failures exceed a threshold, the breaker "trips" and immediately rejects new calls without even attempting them. This:
- Fails fast instead of waiting for timeouts
- Reduces load on the failing service
- Gives the service time to recover
- Prevents cascading failures
How It Works
The circuit breaker has three states:
┌──────────────┐
│ CLOSED │ ← Normal state: requests pass through
└──────┬───────┘ Failures increment counter
│ Successes reset counter
│ When failureCount ≥ threshold
▼
┌──────────────┐
│ OPEN │ ← Failing state: requests rejected immediately
└──────┬───────┘ After resetTimeout, transition to HALF_OPEN
│
│ timeout expires
▼
┌──────────────┐
│ HALF-OPEN │ ← Testing state: allow some requests through
└──────┬───────┘ Success → CLOSED, Failure → OPEN
│
├── success threshold met → CLOSED
└── any failure → OPENWhen to Use Circuit Breaker
Circuit breakers are essential for:
| Scenario | Why Circuit Breaker? |
|---|---|
| External API calls | APIs can go down or throttle |
| Database connections | Prevents connection pool exhaustion |
| Microservice dependencies | Prevents cascading failures |
| Expensive operations | Fails fast instead of waiting |
| Rate-limited services | Respects limits by backing off |
Configuration Guidelines
Failure Threshold
The number of consecutive failures before opening the circuit:
| Traffic Pattern | Recommended Threshold |
|---|---|
| Low volume (< 10 req/s) | 3-5 |
| Medium volume (10-100 req/s) | 5-10 |
| High volume (> 100 req/s) | 10-20 |
Too low: Circuit trips too easily on normal variance Too high: Too many failures occur before protection kicks in
Reset Timeout
How long to stay open before testing recovery:
| Scenario | Recommended Timeout |
|---|---|
| Quick recovery expected (restarts) | 10-30 seconds |
| Normal operations | 30-60 seconds |
| Slow recovery (database issues) | 2-5 minutes |
Half-Open Success Threshold
How many consecutive successes in HALF-OPEN before closing:
| Recommendation | Value |
|---|---|
| Default | 3-5 |
| High-stakes operations | 5-10 |
| Low-stakes operations | 1-2 |
This confirms stable recovery before fully closing.
Implementation Best Practices
1. Distinguish Between Failures
Not all errors should increment the failure counter:
class CircuitBreaker {
private shouldCountAsFailure(error: Error): boolean {
// Count these as failures
if (error instanceof TimeoutError) return true;
if (error instanceof NetworkError) return true;
if (error.response?.status >= 500) return true;
// Don't count these as failures
if (error.response?.status === 404) return false; // Not found isn't a service failure
if (error.response?.status === 401) return false; // Auth issue, not service issue
// Default: count as failure for safety
return true;
}
}2. Provide Fallback Responses
When the circuit is open, don't just throw an error. Provide a graceful fallback:
try {
return await circuitBreaker.execute(() => fetchRecommendations());
} catch (error) {
if (error.message.includes("Circuit breaker is OPEN")) {
// Return cached or default recommendations
return await getCachedRecommendations();
}
throw error;
}3. Monitor Circuit State
Expose metrics for observability:
metrics.gauge("circuit_breaker.state", { service: "payment-gateway" }, state);
metrics
.counter("circuit_breaker.failures", { service: "payment-gateway" })
.increment();
metrics
.counter("circuit_breaker.successes", { service: "payment-gateway" })
.increment();Common Pitfalls
- Too sensitive: Circuit trips on normal variance
- Too slow to recover: Stays open too long, rejecting valid requests
- No fallback: Users see errors when circuit is open
- No monitoring: Can't tell when circuits are tripping
- Wrong error classification: Opens on client errors (4xx) instead of service issues (5xx)
- Missing HALF-OPEN: Goes directly to CLOSED, risking immediate re-trip
Circuit Breaker vs Retry
| Aspect | Circuit Breaker | Retry |
|---|---|---|
| Purpose | Stop calling failing service | Retry individual failed calls |
| Scope | Service-level protection | Request-level resilience |
| State | Has state (CLOSED, OPEN, HALF-OPEN) | Stateless (typically) |
| Use Together? | ✅ Yes | ✅ Yes |
Best Practice: Use circuit breaker to protect against cascading failures, and retry to handle transient issues. They complement each other.
Production Checklist
- ✓ Configure appropriate failure threshold for your traffic volume
- ✓ Set reset timeout based on expected recovery time
- ✓ Implement HALF-OPEN state to test recovery
- ✓ Distinguish service failures from client errors
- ✓ Provide fallback responses when circuit is open
- ✓ Expose circuit state and metrics
- ✓ Log state transitions for debugging
- ✓ Test circuit trip and recovery behavior
- ✓ Monitor circuit breaker tripping in production
Rate Limiting
The Problem
Without rate limiting, your service is vulnerable to:
- Abuse: Malicious actors can overwhelm your service with requests
- Accidental overload: Bugs or misconfigurations can cause excessive traffic
- Cost overruns: External API calls, database queries, and third-party services often cost per request
- Fairness: A single user consuming all resources impacts others
- Downstream protection: Prevent overwhelming dependent services
The Solution
Rate limiting controls the frequency of requests to protect resources, ensure fair usage, and prevent abuse. It enforces limits on how many requests can be made within a time window.
Rate Limiting Scope
| Scope | Description | Use Case |
|---|---|---|
| Global | All requests across the entire service | Overall capacity protection |
| Per User | Limits per authenticated user | Fair usage enforcement |
| Per IP | Limits per IP address | DDoS mitigation, guest users |
| Per API Key | Limits per API key | API tier management |
| Per Endpoint | Limits per specific endpoint | Protect expensive operations |
Algorithm Comparison
| Algorithm | Memory | Precision | Best For |
|---|---|---|---|
| Fixed Window | O(1) per key | Low (can burst at edges) | Simple limits, low memory |
| Sliding Window | O(n) per key | High | Strict enforcement, fairness |
| Token Bucket | O(1) per key | Medium (smooth) | API rate limiting, burst allowance |
| Leaky Bucket | O(1) per key | Medium (constant) | Traffic shaping, smoothing |
Fixed Window Counter
How it works: Divide time into fixed windows (e.g., 1 minute). Count requests in each window. Reset count at window boundary.
class FixedWindowRateLimiter {
private counters: Map<string, { count: number; windowStart: number }> =
new Map();
isAllowed(key: string, maxRequests: number, windowMs: number): boolean {
const now = Date.now();
const current = this.counters.get(key);
// Start new window if needed
if (!current || now - current.windowStart >= windowMs) {
this.counters.set(key, { count: 1, windowStart: now });
return true;
}
// Check limit
if (current.count < maxRequests) {
current.count++;
return true;
}
return false;
}
}Pros: Simple, constant memory, fast
Cons: Can burst at window boundaries (e.g., 100 requests at :59 and 100 at :01 = 200 in 2 seconds for a 100/minute limit)
Sliding Window Log
How it works: Store timestamp of each request. Remove timestamps outside the window. Count remaining.
class SlidingWindowRateLimiter {
private timestamps: Map<string, number[]> = new Map();
isAllowed(key: string, maxRequests: number, windowMs: number): boolean {
const now = Date.now();
const windowStart = now - windowMs;
// Get and filter timestamps for this key
let timestamps = this.timestamps.get(key) || [];
timestamps = timestamps.filter((ts) => ts > windowStart);
// Check if under limit
if (timestamps.length < maxRequests) {
timestamps.push(now);
this.timestamps.set(key, timestamps);
return true;
}
return false;
}
}Pros: Precise, no boundary bursts
Cons: O(n) memory per key, slower with high request volume
Token Bucket
How it works: Imagine a bucket that fills with tokens at a constant rate. Each request consumes a token. If bucket is empty, request is rejected.
class TokenBucketRateLimiter {
private tokens: Map<string, { count: number; lastRefill: number }> =
new Map();
constructor(
private readonly capacity: number, // Max tokens in bucket
private readonly refillRate: number, // Tokens per second
private readonly refillInterval: number = 1000,
) {}
isAllowed(key: string, tokensNeeded: number = 1): boolean {
const now = Date.now();
let state = this.tokens.get(key);
// Initialize new bucket
if (!state) {
state = { count: this.capacity, lastRefill: now };
this.tokens.set(key, state);
}
// Refill tokens
const elapsed = now - state.lastRefill;
if (elapsed >= this.refillInterval) {
const tokensToAdd = Math.floor(
(elapsed / this.refillInterval) * this.refillRate,
);
state.count = Math.min(this.capacity, state.count + tokensToAdd);
state.lastRefill = now;
}
// Check if enough tokens
if (state.count >= tokensNeeded) {
state.count -= tokensNeeded;
return true;
}
return false;
}
}Pros: O(1) memory, allows bursts (up to capacity), smooths traffic
Cons: More complex configuration
Choosing the Right Algorithm
| Scenario | Recommended Algorithm | Why |
|---|---|---|
| Simple API rate limiting | Token Bucket | Smooth traffic, allows brief bursts |
| Strict fairness requirement | Sliding Window | Precise enforcement |
| High-volume, memory-constrained | Fixed Window | Lowest memory overhead |
| Traffic shaping (output) | Leaky Bucket | Constant output rate |
Rate Limiting Headers
When rate limiting, return informative headers:
| Header | Format | Purpose |
|---|---|---|
X-RateLimit-Limit | 100 | Total requests allowed |
X-RateLimit-Remaining | 97 | Requests left in window |
X-RateLimit-Reset | 1711838400 | Unix timestamp when window resets |
Retry-After | 60 | Seconds until user can retry (when limited) |
Common Pitfalls
- Wrong scope: Using global limits when per-user is needed (or vice versa)
- No burst allowance: Rejecting legitimate bursts of activity
- Tight limits: Blocking normal user behavior
- No informative headers: Users can't tell when they'll be unblocked
- In-memory only: Limits reset on restart (use Redis for distributed systems)
- Ignoring time zones: Window boundaries can shift unexpectedly
Production Checklist
- ✓ Choose appropriate scope (global, per-user, per-IP)
- ✓ Select algorithm based on requirements
- ✓ Use persistent storage (Redis) for distributed systems
- ✓ Return rate limit headers in responses
- ✓ Implement graceful degradation (not hard errors)
- ✓ Log rate limit violations for abuse detection
- ✓ Monitor rate limit effectiveness
- ✓ Test limit boundary conditions
Graceful Failure and Fallback
The Problem
When something goes wrong, most systems default to throwing errors and showing users generic error messages. This creates a terrible user experience and can make minor issues feel like complete outages.
In production, partial failures are normal. A single dependency going down shouldn't take down your entire application.
The Solution
Graceful failure means your system continues to provide useful functionality even when some components are unavailable. Instead of all-or-nothing, you provide degraded service.
Fallback means having alternative ways to accomplish a task when the primary method fails.
The Fallback Hierarchy
Primary Operation
│
├── Success → Return Result
│
└── Failure
│
├── Try Cache → Return Stale Data
│
└── Cache Miss
│
├── Try Secondary Service → Return Result
│
└── Secondary Fails
│
└── Return Safe DefaultFallback Strategies
| Strategy | Description | Example | Trade-off |
|---|---|---|---|
| Static Fallback | Return predefined default value | Empty list, default user | Simple but may not be useful |
| Cached Response | Serve previously cached data | 5-minute old product data | Stale but functional |
| Feature Disable | Hide non-critical functionality | Disable recommendations | Core features still work |
| Queue for Later | Accept request, process asynchronously | Email sending | Async processing complexity |
| Secondary Service | Use alternative implementation | Read from replica DB | Additional cost/complexity |
| Partial Response | Return what's available | Show loaded products, hide prices | Incomplete but useful |
Example: Resilient User Service
class ResilientUserService implements UserService {
constructor(
private primary: UserService, // Main database
private cache: CacheService, // Redis cache
private fallback: UserService, // Read replica or backup
private metrics: MetricsService,
) {}
async getUser(id: string): Promise<User> {
try {
// 1. Try primary source
const user = await this.primary.getUser(id);
await this.cache.set(`user:${id}`, user, { ttl: 300 }); // Cache for 5 minutes
return user;
} catch (primaryError) {
this.metrics.increment("user_service.primary_failure");
try {
// 2. Try cache
const cached = await this.cache.get<User>(`user:${id}`);
if (cached) {
this.metrics.increment("user_service.cache_hit");
cached.stale = true; // Mark as stale so UI can handle
return cached;
}
} catch (cacheError) {
this.metrics.increment("user_service.cache_failure");
}
try {
// 3. Try fallback service
const user = await this.fallback.getUser(id);
this.metrics.increment("user_service.fallback_success");
return user;
} catch (fallbackError) {
this.metrics.increment("user_service.fallback_failure");
// 4. Return safe default
return this.getDefaultUser(id);
}
}
}
private getDefaultUser(id: string): User {
return {
id,
name: "Guest User",
isDefault: true,
isAnonymous: true,
};
}
}Principles of Graceful Degradation
1. Identify Critical vs. Non-Critical Paths
Not all features are equally important:
| Critical (Must Work) | Non-Critical (Can Fail) |
|---|---|
| Authentication | Recommendations |
| Core transactions | Search filters |
| Payment processing | Analytics tracking |
| Data persistence | Real-time notifications |
| Security features | Social sharing |
2. Design Fallbacks Early
Don't add fallbacks as an afterthought. Design them into your architecture from the start:
// Design with fallback in mind
interface ProductCatalog {
getProducts(): Promise<Product[]>;
// What if this fails? What's the fallback?
}3. Make Fallbacks Visible to Users
When serving degraded data, let users know:
interface User {
id: string;
name: string;
stale?: boolean; // Flag indicates data might be old
isDefault?: boolean; // Flag indicates this is a default
}
// UI can show: "Showing cached data (may be outdated)"4. Monitor Fallback Usage
If you're constantly hitting fallbacks, something is wrong:
if (metrics.get("user_service.cache_hit").rate > 0.5) {
alert("Primary user service is degraded! 50% of requests served from cache.");
}Real-World Examples
Netflix's Fallback Strategy
When Netflix's recommendation engine fails, they don't show an error. Instead:
- Fall back to popular/trending content
- Fall back to user's watch history
- Fall back to curated collections
- Show a message: "Recommendations temporarily unavailable, here's what's popular"
Amazon's Add to Cart
When the inventory service is unavailable:
- Still allow adding to cart
- Show "in stock" with caveat
- Validate inventory at checkout
- If out of stock then, notify user
Slack's Message Sending
When the primary message store is down:
- Store message locally
- Show "pending" indicator
- Retry in background
- Sync when connection restored
Common Pitfalls
- Silent failures: Users don't know something is degraded
- Cascading fallbacks: Fallback depends on the same failing system
- Stale data freshness: Not indicating when data is old
- No fallback: All-or-nothing approach
- Complex fallback chains: Too many levels make debugging hard
- Ignoring errors: Not logging or monitoring fallback usage
Production Checklist
- ✓ Identify critical vs. non-critical features
- ✓ Design fallback paths for all critical operations
- ✓ Use caching as a primary fallback strategy
- ✓ Implement safe defaults for all user-facing data
- ✓ Mark degraded responses (stale, default, etc.)
- ✓ Monitor fallback usage rates
- ✓ Alert when fallback usage is high
- ✓ Test fallback paths regularly
Idempotency
The Problem
In distributed systems, networks are unreliable. A request might:
- Timeout before reaching the server
- Reach the server but timeout waiting for response
- Get a 5xx error from the server
- Succeed but the response gets lost
In all these cases, the client naturally wants to retry the request. But retrying non-idempotent operations causes problems:
Client → Server: POST /charge $100 (timeout)
Client → Server: POST /charge $100 (timeout)
Client → Server: POST /charge $100 (success)
Result: Customer charged $300 instead of $100!The Solution
An operation is idempotent if performing it multiple times produces the same result as performing it once.
| Operation | Idempotent? | Reason |
|---|---|---|
GET /users/1 | ✅ Yes | Reading doesn't change state |
PUT /users/1 | ✅ Yes | Same payload = same final state |
DELETE /users/1 | ✅ Yes | Once deleted, stays deleted |
POST /orders | ❌ No | Creates new order each time |
POST /payments | ❌ No | Charges each time |
For non-idempotent operations, use an idempotency key to make them safe to retry.
Idempotency Key Pattern
The idempotency key is a unique identifier provided by the client. The server:
- Checks if this key has been processed before
- If yes, returns the stored result
- If no, processes the request and stores the result with the key
Request 1: POST /payments
Headers: Idempotency-Key: abc123
Body: { amount: 100 }
→ Process payment, store result for abc123, return result
Request 2: POST /payments (retry)
Headers: Idempotency-Key: abc123
Body: { amount: 100 }
→ Return stored result for abc123 (no new charge)
Request 3: POST /payments (different request)
Headers: Idempotency-Key: xyz789
Body: { amount: 50 }
→ Process payment, store result for xyz789, return resultIdempotency Key Design
1. Key Generation
// Option 1: Client-generated UUID (most common)
const idempotencyKey = crypto.randomUUID(); // "550e8400-e29b-41d4-a716-446655440000"
// Option 2: Deterministic based on request details
function generateDeterministicKey(
userId: string,
operation: string,
params: object,
): string {
const hash = createHash("sha256");
hash.update(userId);
hash.update(operation);
hash.update(JSON.stringify(params));
return hash.digest("hex"); // "a3d5e9f2..."
}
// Option 3: Combination
const idempotencyKey = `${userId}:${operation}:${uuid()}`;Recommendation: Use client-generated UUIDs for maximum flexibility. Use deterministic keys only if you need deduplication across clients.
2. Key Scope
Idempotency keys should be scoped to prevent collisions:
| Scope | Example | When to Use |
|---|---|---|
| Global | abc123 | Simple systems, single operation type |
| Per-user | user123:abc123 | Multi-tenant systems |
| Per-operation | payment:abc123 | Multiple operation types |
| Combined | user123:payment:abc123 | Complex systems |
3. Key Expiration
Don't store idempotency keys forever:
| Use Case | Recommended TTL |
|---|---|
| Payments | 24-48 hours (dispute window) |
| Orders | 7 days (typical order lifecycle) |
| Email sending | 1 hour (retry window) |
| General purpose | 24 hours |
Implementation
Simple In-Memory Version
class IdempotencyService {
private results = new Map<string, { result: any; timestamp: number }>();
async execute<T>(
key: string,
operation: () => Promise<T>,
ttlMs: number = 86400000, // 24 hours
): Promise<T> {
const existing = this.results.get(key);
// Return cached result if exists and not expired
if (existing) {
if (Date.now() - existing.timestamp < ttlMs) {
console.log(`Returning cached result for idempotency key: ${key}`);
return existing.result;
} else {
// Expired, remove it
this.results.delete(key);
}
}
// Execute operation
const result = await operation();
// Store result
this.results.set(key, { result, timestamp: Date.now() });
return result;
}
}Database-Persisted Version
// Schema for idempotency storage
interface IdempotencyRecord {
key: string; // Primary key
result: string; // JSON serialized result
createdAt: Date;
expiresAt: Date;
requestParams?: string; // For debugging/verification
}
async function idempotentOperation<T>(
db: Database,
key: string,
operation: () => Promise<T>,
options: {
ttlMs?: number;
verifyParams?: object; // Optionally verify params match
} = {},
): Promise<T> {
const { ttlMs = 86400000, verifyParams } = options;
const now = new Date();
// Check for existing record
const existing = await db.idempotencyKeys.findUnique({
where: { key },
});
if (existing) {
// Verify not expired
if (existing.expiresAt > now) {
// Optionally verify params match
if (verifyParams && existing.requestParams) {
const storedParams = JSON.parse(existing.requestParams);
if (!deepEqual(storedParams, verifyParams)) {
throw new Error("Idempotency key reused with different parameters");
}
}
console.log(`Returning cached result for idempotency key: ${key}`);
return JSON.parse(existing.result);
} else {
// Delete expired record
await db.idempotencyKeys.delete({ where: { key } });
}
}
// Execute operation
const result = await operation();
// Store result with expiration
await db.idempotencyKeys.create({
data: {
key,
result: JSON.stringify(result),
requestParams: verifyParams ? JSON.stringify(verifyParams) : null,
createdAt: now,
expiresAt: new Date(now.getTime() + ttlMs),
},
});
return result;
}
// Usage
const payment = await idempotentOperation(
db,
idempotencyKey,
() => paymentProcessor.charge(amount, customerId),
{
ttlMs: 172800000, // 48 hours for payments
verifyParams: { amount, customerId },
},
);Best Practices
1. Always Return the Same Response
When returning a cached idempotency result, return the exact same response including:
- Same status code
- Same headers
- Same body
- Same timestamps
// Store the full response
interface StoredResponse {
status: number;
headers: Record<string, string>;
body: any;
timestamp: Date;
}2. Handle Concurrent Requests
Two requests with the same idempotency key might arrive simultaneously:
async function executeIdempotently<T>(key: string, operation: () => Promise<T>): Promise<T> {
// Use database transaction to prevent race conditions
return await db.transaction(async (tx) => {
const existing = await tx.idempotencyKeys.findUnique({ where: { key } });
if (existing) {
return JSON.parse(existing.result);
}
const result = await operation();
await tx.idempotencyKeys.create({
data: { key, result: JSON.stringify(result), expiresAt: ... }
});
return result;
});
}3. Idempotency Key in Response
Always include the idempotency key in the response:
// Request
POST /payments
Idempotency-Key: abc123
// Response
201 Created
X-Idempotency-Key: abc123This allows clients to verify their request was processed.
4. HTTP Methods and Idempotency
| HTTP Method | Idempotent by Default? | Should Use Idempotency Key? |
|---|---|---|
| GET | ✅ Yes | No |
| HEAD | ✅ Yes | No |
| OPTIONS | ✅ Yes | No |
| PUT | ✅ Yes | Optional (for idempotency checks) |
| DELETE | ✅ Yes | Optional (for idempotency checks) |
| POST | ❌ No | Yes |
| PATCH | ❌ No | Yes |
Common Pitfalls
- Not persisting keys: Lost on restart, breaks idempotency
- No expiration: Storage grows indefinitely
- Wrong scope: Keys collide between users or operations
- Changing stored results: Must return exact same response
- Ignoring concurrent requests: Race conditions cause duplicate processing
- Not verifying parameters: Reusing key with different params causes confusion
Production Checklist
- ✓ Use idempotency keys for all POST/PATCH operations
- ✓ Generate keys client-side (UUID recommended)
- ✓ Persist keys in database (not just memory)
- ✓ Set appropriate TTL for your use case
- ✓ Return exact same response on idempotency hit
- ✓ Handle concurrent requests with transactions
- ✓ Return idempotency key in response headers
- ✓ Verify parameters match on key reuse (optional but recommended)
- ✓ Monitor idempotency key usage and hit rates
- ✓ Document idempotency behavior for API consumers
Putting It All Together
The Defense in Depth Approach
These patterns work best when used together as a layered defense:
┌─────────────────────────────────────────────────────────┐
│ Request Flow │
└─────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────┐
│ 1. Rate Limiting │ ← Protect your resources
│ (Token Bucket) │ from being overwhelmed
└──────────┬─────────────┘
│ Allowed
▼
┌────────────────────────┐
│ 2. Idempotency Check │ ← Prevent duplicate
│ (Database Lookup) │ processing
└──────────┬─────────────┘
│ New request
▼
┌────────────────────────┐
│ 3. Circuit Breaker │ ← Stop calling failing
│ (State Check) │ downstream services
└──────────┬─────────────┘
│ Closed/Half-Open
▼
┌────────────────────────┐
│ 4. Retry Logic │ ← Handle transient
│ (Backoff + Jitter) │ failures
└──────────┬─────────────┘
│ Success or Exhausted
▼
┌────────────────────────┐
│ 5. Primary Operation │ ← Actual business logic
└──────────┬─────────────┘
│
├── Success → Cache result
│
└── Failure → Fallback strategiesExample: Resilient Order Service
class ResilientOrderService {
private circuitBreaker: CircuitBreaker;
private rateLimiter: TokenBucketRateLimiter;
private idempotencyService: IdempotencyService;
private cacheService: CacheService;
private fallbackService: OrderService;
constructor(
private database: Database,
private paymentGateway: PaymentGateway,
private inventoryService: InventoryService,
private messageQueue: MessageQueue,
) {
// Circuit breaker for payment gateway (can be flaky)
this.circuitBreaker = new CircuitBreaker({
failureThreshold: 5,
resetTimeoutMs: 30000, // 30 seconds
halfOpenSuccessThreshold: 3,
});
// Rate limit: 100 orders per minute per user
this.rateLimiter = new TokenBucketRateLimiter({
capacity: 100,
refillRate: 100 / 60, // ~1.67 per second
refillInterval: 1000,
});
// Idempotency: 24 hour TTL for orders
this.idempotencyService = new IdempotencyService({ ttlMs: 86400000 });
}
async createOrder(request: CreateOrderRequest): Promise<Order> {
const { idempotencyKey, userId, items, paymentMethod } = request;
// 1. Rate limiting (per user)
if (!this.rateLimiter.isAllowed(userId, 1)) {
throw new RateLimitError({
limit: 100,
window: "1 minute",
retryAfter: 60,
});
}
// 2. Idempotency check
return await this.idempotencyService.execute(
idempotencyKey,
async () => await this.processOrder(request),
{ ttlMs: 86400000, verifyParams: request },
);
}
private async processOrder(request: CreateOrderRequest): Promise<Order> {
const { userId, items, paymentMethod } = request;
try {
// 3. Reserve inventory (with fallback)
const inventory = await this.withFallback(
() => this.inventoryService.reserve(items),
async () => {
// Fallback: Check inventory optimistically
const available = await this.cacheService.getInventory();
return this.inventoryService.reserveOptimistically(items, available);
},
() => ({ reserved: true, optimistic: true }),
);
if (!inventory.reserved) {
throw new OutOfStockError();
}
// 4. Process payment (with circuit breaker + retry)
const payment = await this.circuitBreaker.execute(async () => {
return await retryWithBackoffAndJitter(
() =>
this.paymentGateway.charge({
amount: this.calculateTotal(items),
method: paymentMethod,
userId,
}),
{ maxRetries: 3, baseDelayMs: 100 },
);
});
// 5. Create order in database
const order = await this.database.orders.create({
userId,
items,
paymentId: payment.id,
status: "CONFIRMED",
createdAt: new Date(),
});
// 6. Cache for future reads
await this.cacheService.set(`order:${order.id}`, order, { ttl: 300 });
// 7. Publish order event (async, non-blocking)
this.messageQueue
.publish("orders.created", { orderId: order.id })
.catch((err) => {
console.error("Failed to publish order event:", err);
});
return order;
} catch (error) {
// Handle different error types
if (error instanceof CircuitBreakerOpenError) {
// Payment gateway is down, queue for later
await this.messageQueue.publish("orders.pending", { request });
return this.createPendingOrder(request, "PAYMENT_UNAVAILABLE");
}
if (error instanceof PaymentError) {
// Payment failed, don't retry
throw new OrderCreationError("Payment failed", { code: error.code });
}
if (error instanceof DatabaseError) {
// Database issue, try fallback
try {
return await this.fallbackService.createOrder(request);
} catch (fallbackError) {
// Last resort: Queue for processing
await this.messageQueue.publish("orders.pending", { request });
return this.createPendingOrder(request, "SYSTEM_UNAVAILABLE");
}
}
throw error;
}
}
private async withFallback<T>(
primary: () => Promise<T>,
fallback: () => Promise<T>,
defaultValue: () => T,
): Promise<T> {
try {
return await primary();
} catch (primaryError) {
console.error("Primary failed, trying fallback:", primaryError);
try {
return await fallback();
} catch (fallbackError) {
console.error("Fallback failed, using default:", fallbackError);
return defaultValue();
}
}
}
private createPendingOrder(
request: CreateOrderRequest,
reason: string,
): Order {
return {
id: generateId(),
userId: request.userId,
items: request.items,
status: "PENDING",
pendingReason: reason,
createdAt: new Date(),
};
}
}Configuration Summary
| Pattern | Key Parameters | Typical Values |
|---|---|---|
| Retry | maxRetries, baseDelayMs | 3-5 retries, 100-500ms base |
| Circuit Breaker | failureThreshold, resetTimeoutMs | 5-10 failures, 30-60s timeout |
| Rate Limiting | capacity, refillRate | 100-1000 req, window varies |
| Idempotency | ttlMs | 1-48 hours (by use case) |
Monitoring Checklist
For a resilient system, monitor:
- ✓ Retry rate and success rate
- ✓ Circuit breaker state transitions
- ✓ Rate limit violations
- ✓ Idempotency key hit rate
- ✓ Fallback usage frequency
- ✓ End-to-end latency
- ✓ Error rates by type
Key Takeaways
- Failures are inevitable—design your system to handle them gracefully from the start
- Retry with backoff + jitter for transient failures, but never retry non-idempotent operations
- Circuit breakers prevent cascading failures by stopping calls to unhealthy services
- Rate limiting protects resources; choose the algorithm based on your precision vs. memory trade-off
- Graceful fallback means degraded service is better than no service—always have a plan B
- Idempotency keys make non-idempotent operations safe to retry, essential for distributed systems
- These patterns work together as layers of defense, not alternatives to each other
- Monitor everything—you can't improve what you don't measure
- Test failure paths—chaos engineering and failure injection reveal weak points
- Document your reliability strategy—your team needs to understand why these patterns exist
Remember: A system that fails gracefully is more reliable than one that never fails but goes down hard when it does.
