Build comprehensive observability into your applications. Learn how to instrument code for logging, tracing, and metrics that solve real production problems.

Your application is running in production right now. A user is frustrated because a page loads slowly. An error is happening for some customers but not others. A database query is consuming 80% of CPU. Without proper observability, you're flying blind — investigating issues takes hours instead of minutes.

Observability is the ability to understand what's happening inside your system by examining its external outputs. It's built on three pillars: logging (discrete events), tracing (request flow), and metrics (aggregated data).

Logging: Events That Tell Stories

Logs are records of events: "User logged in", "Database query took 200ms", "Payment processing failed". Good logs help you reconstruct what happened.

Best practices for logging:

Use structured logging (JSON) not unstructured text. Structured logs are queryable and parseable:

// Bad (unstructured)
logger.info("User 123 logged in from 192.168.1.1")

// Good (structured)
logger.info("user_login", {
  userId: 123,
  ip: "192.168.1.1",
  timestamp: new Date(),
})

Log levels matter: DEBUG (detailed for development), INFO (important events), WARN (something unexpected but not an error), ERROR (errors that might need action), FATAL (system is broken).

Log important events at INFO level. Log errors at ERROR level. Don't spam DEBUG logs — they slow things down.

Include context: Every log should include request ID, user ID, or other context. This lets you trace a request through the system.

// Request middleware
app.use((req, res, next) => {
  req.id = generateId()
  logger.info("request_start", {
    requestId: req.id,
    method: req.method,
    path: req.path,
    userId: req.user?.id,
  })
  next()
})

// In your code
logger.info("database_query", {
  requestId: req.id,
  query: "SELECT * FROM users",
  duration: 145, // ms
})

Centralize logs: Don't write logs to files on servers. Send them to a centralized service (DataDog, New Relic, ELK stack). This lets you search across all servers and correlate events.

Distributed Tracing: Following the Request

A user clicks a button. The web server processes the request, calls the API server, which queries the database and calls an external service. If something is slow, where's the bottleneck?

Distributed tracing follows requests across services. Each component adds timestamps. At the end, you see a waterfall of where time was spent.

Tools: Jaeger, Zipkin, Datadog APM, New Relic. Implementation:

// Pass trace ID through systems
const traceId = req.headers['x-trace-id'] || generateId()

// Forward to downstream services
const response = await fetch('http://api-server/endpoint', {
  headers: {
    'x-trace-id': traceId,
  },
})

// Other services receive and propagate the trace ID
app.use((req, res, next) => {
  const traceId = req.headers['x-trace-id']
  req.traceId = traceId

  // When calling other services, pass it along
  next()
})

With tracing, you can answer: "Why did request #12345 take 5 seconds?" You see database took 200ms, payment service took 4.5s, everything else was < 100ms. The bottleneck is clear.

Metrics: Aggregated Data

Metrics are numbers: request count, response time, error rate, database connections, CPU usage. Unlike logs (discrete), metrics are aggregated — "99th percentile response time over the last 5 minutes."

Key metrics to track:

Request count (by endpoint, status code, method)
Response time (median, p95, p99)
Error rate
Database query time (p95, p99)
Cache hit rate
Queue depth
Resource usage (CPU, memory, disk, network)

// Using a metrics library (e.g., prom-client for Prometheus)
const httpDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.5, 1, 2, 5],
})

app.use((req, res, next) => {
  const start = Date.now()

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000
    httpDuration
      .labels(req.method, req.route.path, res.statusCode)
      .observe(duration)
  })

  next()
})

Alerting: When to Wake You Up

Metrics are useless if you don't act on them. Set up alerts for conditions that require human intervention.

Good alerts: Error rate > 1%, p99 response time > 5 seconds, database connections near max

Bad alerts: CPU usage > 50% (normal), request count changed (not actionable), cache hit rate < 95% (cosmetic)

Alert fatigue (too many alerts) makes developers ignore alerts. Alert only on things that need human attention.

The Three Pillars Together

Logs tell you what happened. Metrics tell you it happened at scale. Tracing tells you how long it took. Together, they give you complete visibility.

Invest in observability early. When production breaks at 3 AM, you'll be grateful for good logs, traces, and metrics. They're the difference between 5-minute fix and 5-hour debugging session.

Get In Touch

Observability in Production: Logging, Tracing, and Metrics That Actually Help

Logging: Events That Tell Stories

Distributed Tracing: Following the Request

Metrics: Aggregated Data

Alerting: When to Wake You Up

The Three Pillars Together

Comments (0)

Leave a Comment

Start Your Project

Get In Touch

Blog

Observability in Production: Logging, Tracing, and Metrics That Actually Help

Logging: Events That Tell Stories

Distributed Tracing: Following the Request

Metrics: Aggregated Data

Alerting: When to Wake You Up

The Three Pillars Together

Comments (0)

Leave a Comment

Start Your Project