Observability

🔭Key Takeaways

1
Three pillars: Logs (what happened), Metrics (what is the state), Traces (how requests flow across services)
2
SLI (indicator) → SLO (objective) → SLA (agreement) → Error Budget (allowance for failure)
3
Structured logging (JSON) + correlation IDs enable tracking requests across services
4
Distributed tracing (Jaeger, OpenTelemetry) is essential for debugging microservice latency

Understanding System Behavior in Production

Observability is the ability to understand what's happening inside your system from its external outputs. In microservices with dozens of services, you can't SSH into a server and grep logs — you need structured, correlated, searchable telemetry.

The Three Pillars

Pillar	What	Tools	Question Answered
Logs	Timestamped events (structured JSON)	ELK Stack, Datadog Logs, CloudWatch	What happened?
Metrics	Numerical measurements over time	Prometheus + Grafana, Datadog, CloudWatch	What is the current state?
Traces	Request flow across services	Jaeger, Zipkin, AWS X-Ray, OpenTelemetry	Where is the bottleneck?

SLI (Service Level Indicator)

A measurable metric: request latency, error rate, availability. Example: 'Proportion of requests completed in < 200ms.'

Advantages

•Structured observability drastically reduces MTTR
•Error budgets balance reliability with velocity
•OpenTelemetry provides vendor-neutral instrumentation

Disadvantages

•Observability infrastructure is expensive at scale
•Too many alerts cause alert fatigue
•Instrumentation requires upfront investment in every service

🧪 Test Your Understanding

Knowledge Check1/1

What tool category helps debug latency across microservices?