Intermediate → Advanced28 min read· Topic 9.3

Observability

Logs, metrics, traces, SLI/SLO/SLA, error budgets, on-call and incident response

🔭Key Takeaways

  • 1
    Three pillars: Logs (what happened), Metrics (what is the state), Traces (how requests flow across services)
  • 2
    SLI (indicator) → SLO (objective) → SLA (agreement) → Error Budget (allowance for failure)
  • 3
    Structured logging (JSON) + correlation IDs enable tracking requests across services
  • 4
    Distributed tracing (Jaeger, OpenTelemetry) is essential for debugging microservice latency

Understanding System Behavior in Production

Observability is the ability to understand what's happening inside your system from its external outputs. In microservices with dozens of services, you can't SSH into a server and grep logs — you need structured, correlated, searchable telemetry.

The Three Pillars

PillarWhatToolsQuestion Answered
LogsTimestamped events (structured JSON)ELK Stack, Datadog Logs, CloudWatchWhat happened?
MetricsNumerical measurements over timePrometheus + Grafana, Datadog, CloudWatchWhat is the current state?
TracesRequest flow across servicesJaeger, Zipkin, AWS X-Ray, OpenTelemetryWhere is the bottleneck?

SLI (Service Level Indicator)

A measurable metric: request latency, error rate, availability. Example: 'Proportion of requests completed in < 200ms.'

Advantages

  • Structured observability drastically reduces MTTR
  • Error budgets balance reliability with velocity
  • OpenTelemetry provides vendor-neutral instrumentation

Disadvantages

  • Observability infrastructure is expensive at scale
  • Too many alerts cause alert fatigue
  • Instrumentation requires upfront investment in every service

🧪 Test Your Understanding

Knowledge Check1/1

What tool category helps debug latency across microservices?