Learn patterns for building resilient systems that fail gracefully. Understand circuit breakers, retry strategies, and graceful degradation.

Distributed systems are complex. External APIs fail. Databases become slow. Networks have blips. A system that breaks under any adverse condition is fragile. A resilient system expects failures and handles them gracefully.

I've built systems that crashed from a single slow API call, and I've built systems that kept running despite multiple failures. The difference is resilience patterns: circuit breakers, retries, timeouts, and graceful degradation.

The Circuit Breaker Pattern

Imagine an electrical circuit breaker: when current surges (fault), it trips and stops the flow. The circuit breaker pattern applies this concept to code.

When you call an external service and it fails repeatedly, stop calling it (trip the circuit). After a delay, try again cautiously (half-open state). If it succeeds, resume normal operation.

class CircuitBreaker {
  constructor(fn, options = {}) {
    this.fn = fn
    this.failures = 0
    this.successThreshold = options.successThreshold || 2
    this.failureThreshold = options.failureThreshold || 5
    this.timeout = options.timeout || 60000
    this.state = 'CLOSED' // CLOSED, OPEN, HALF_OPEN
    this.nextAttempt = Date.now()
  }

  async call() {
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttempt) {
        throw new Error('Circuit breaker is OPEN')
      }
      this.state = 'HALF_OPEN'
    }

    try {
      const result = await this.fn()
      this.onSuccess()
      return result
    } catch (error) {
      this.onFailure()
      throw error
    }
  }

  onSuccess() {
    this.failures = 0
    if (this.state === 'HALF_OPEN') {
      this.state = 'CLOSED'
    }
  }

  onFailure() {
    this.failures++
    if (this.failures >= this.failureThreshold) {
      this.state = 'OPEN'
      this.nextAttempt = Date.now() + this.timeout
    }
  }
}

// Usage
const paymentBreaker = new CircuitBreaker(
  () => paymentService.charge(amount),
  { failureThreshold: 5, successThreshold: 2, timeout: 30000 }
)

try {
  await paymentBreaker.call()
} catch (error) {
  logger.error("Payment service unavailable", { error })
}

Retry Strategies

Transient failures (temporary network hiccup, temporary database slowness) often resolve quickly. Retry.

Dumb retry: Try again immediately. Bad. If the service is slow, you make it slower.

Exponential backoff: Wait increasingly longer between retries. First retry after 1s, then 2s, then 4s.

Exponential backoff with jitter: Add randomness so you don't retry at the same time as other clients.

async function callWithRetry(fn, maxRetries = 3) {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn()
    } catch (error) {
      if (attempt === maxRetries) throw error

      const baseDelay = Math.pow(2, attempt) * 1000 // 1s, 2s, 4s
      const jitter = Math.random() * baseDelay
      const delay = baseDelay + jitter

      logger.warn("Retry after failure", { attempt, delay, error })
      await sleep(delay)
    }
  }
}

// Usage
const user = await callWithRetry(() =>
  fetch('https://api.example.com/user/123').then(r => r.json()),
  3
)

Not all errors are retryable: 404 (not found), 401 (unauthorized) won't succeed on retry. Only retry 5xx errors and network timeouts.

Timeouts: Don't Wait Forever

If a service never responds, your request hangs forever. Set timeouts.

// Timeout after 5 seconds
const response = await Promise.race([
  fetch('https://api.example.com/endpoint'),
  new Promise((_, reject) =>
    setTimeout(() => reject(new Error('Timeout')), 5000)
  ),
])

// Or with native AbortController
const controller = new AbortController()
const timeoutId = setTimeout(() => controller.abort(), 5000)

try {
  const response = await fetch(url, { signal: controller.signal })
} catch (error) {
  if (error.name === 'AbortError') {
    logger.error("Request timeout")
  }
}

Graceful Degradation

When something fails, what can you still do?

Recommendations service is down? Show most-viewed items instead of personalized
Profile picture service is down? Show a placeholder avatar
Search is slow? Show recent items while search completes

Graceful degradation means the service keeps working, just with reduced functionality. Better than a broken site.

Putting It Together

A resilient API call might look like:

async function getUserProfile(userId) {
  try {
    // Circuit breaker + timeout + retry
    return await circuitBreaker.call(async () => {
      return await callWithRetry(
        () => fetch(`/api/users/${userId}`, { timeout: 5000 }),
        2
      )
    })
  } catch (error) {
    // Service is really down, degrade gracefully
    logger.error("Unable to fetch profile", { userId, error })
    return getCachedProfile(userId) || getDefaultProfile(userId)
  }
}

Resilience patterns let you build systems that don't break. They're not optional for production systems — they're essential.

Get In Touch

Building Resilient Systems: Circuit Breakers, Retries, and Graceful Degradation

The Circuit Breaker Pattern

Retry Strategies

Timeouts: Don't Wait Forever

Graceful Degradation

Putting It Together

Comments (0)

Leave a Comment

Start Your Project

Get In Touch

Blog

Building Resilient Systems: Circuit Breakers, Retries, and Graceful Degradation

The Circuit Breaker Pattern

Retry Strategies

Timeouts: Don't Wait Forever

Graceful Degradation

Putting It Together

Comments (0)

Leave a Comment

Start Your Project