Distributed systems are complex. External APIs fail. Databases become slow. Networks have blips. A system that breaks under any adverse condition is fragile. A resilient system expects failures and handles them gracefully.
I've built systems that crashed from a single slow API call, and I've built systems that kept running despite multiple failures. The difference is resilience patterns: circuit breakers, retries, timeouts, and graceful degradation.
The Circuit Breaker Pattern
Imagine an electrical circuit breaker: when current surges (fault), it trips and stops the flow. The circuit breaker pattern applies this concept to code.
When you call an external service and it fails repeatedly, stop calling it (trip the circuit). After a delay, try again cautiously (half-open state). If it succeeds, resume normal operation.
class CircuitBreaker {
constructor(fn, options = {}) {
this.fn = fn
this.failures = 0
this.successThreshold = options.successThreshold || 2
this.failureThreshold = options.failureThreshold || 5
this.timeout = options.timeout || 60000
this.state = 'CLOSED' // CLOSED, OPEN, HALF_OPEN
this.nextAttempt = Date.now()
}
async call() {
if (this.state === 'OPEN') {
if (Date.now() < this.nextAttempt) {
throw new Error('Circuit breaker is OPEN')
}
this.state = 'HALF_OPEN'
}
try {
const result = await this.fn()
this.onSuccess()
return result
} catch (error) {
this.onFailure()
throw error
}
}
onSuccess() {
this.failures = 0
if (this.state === 'HALF_OPEN') {
this.state = 'CLOSED'
}
}
onFailure() {
this.failures++
if (this.failures >= this.failureThreshold) {
this.state = 'OPEN'
this.nextAttempt = Date.now() + this.timeout
}
}
}
// Usage
const paymentBreaker = new CircuitBreaker(
() => paymentService.charge(amount),
{ failureThreshold: 5, successThreshold: 2, timeout: 30000 }
)
try {
await paymentBreaker.call()
} catch (error) {
logger.error("Payment service unavailable", { error })
}
Retry Strategies
Transient failures (temporary network hiccup, temporary database slowness) often resolve quickly. Retry.
Dumb retry: Try again immediately. Bad. If the service is slow, you make it slower.
Exponential backoff: Wait increasingly longer between retries. First retry after 1s, then 2s, then 4s.
Exponential backoff with jitter: Add randomness so you don't retry at the same time as other clients.
async function callWithRetry(fn, maxRetries = 3) {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn()
} catch (error) {
if (attempt === maxRetries) throw error
const baseDelay = Math.pow(2, attempt) * 1000 // 1s, 2s, 4s
const jitter = Math.random() * baseDelay
const delay = baseDelay + jitter
logger.warn("Retry after failure", { attempt, delay, error })
await sleep(delay)
}
}
}
// Usage
const user = await callWithRetry(() =>
fetch('https://api.example.com/user/123').then(r => r.json()),
3
)
Not all errors are retryable: 404 (not found), 401 (unauthorized) won't succeed on retry. Only retry 5xx errors and network timeouts.
Timeouts: Don't Wait Forever
If a service never responds, your request hangs forever. Set timeouts.
// Timeout after 5 seconds
const response = await Promise.race([
fetch('https://api.example.com/endpoint'),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), 5000)
),
])
// Or with native AbortController
const controller = new AbortController()
const timeoutId = setTimeout(() => controller.abort(), 5000)
try {
const response = await fetch(url, { signal: controller.signal })
} catch (error) {
if (error.name === 'AbortError') {
logger.error("Request timeout")
}
}
Graceful Degradation
When something fails, what can you still do?
- Recommendations service is down? Show most-viewed items instead of personalized
- Profile picture service is down? Show a placeholder avatar
- Search is slow? Show recent items while search completes
Graceful degradation means the service keeps working, just with reduced functionality. Better than a broken site.
Putting It Together
A resilient API call might look like:
async function getUserProfile(userId) {
try {
// Circuit breaker + timeout + retry
return await circuitBreaker.call(async () => {
return await callWithRetry(
() => fetch(`/api/users/${userId}`, { timeout: 5000 }),
2
)
})
} catch (error) {
// Service is really down, degrade gracefully
logger.error("Unable to fetch profile", { userId, error })
return getCachedProfile(userId) || getDefaultProfile(userId)
}
}
Resilience patterns let you build systems that don't break. They're not optional for production systems — they're essential.