Advanced25 min readยท Topic 11.1

ML system design fundamentals

Online vs offline ML, feature stores, training pipelines, model serving, A/B testing, monitoring

๐Ÿค–Key Takeaways

  • 1
    ML systems have TWO pipelines: training (offline, batch) and serving (online, real-time)
  • 2
    Feature stores centralize feature engineering: compute once, serve consistently to training and inference
  • 3
    Model serving: batch (pre-compute), online (request-time), near-real-time (streaming)
  • 4
    Monitoring ML models: data drift, concept drift, feature skew between training and serving

ML in Production is a System Design Problem

Building an ML model is 5% of the work. The other 95% is system design: data pipelines, feature engineering, training infrastructure, model serving, monitoring, and retraining. ML system design interviews test this full picture, not just model accuracy.

Data Collection & Processing

Ingest raw data from databases, event streams, and external sources. Clean, validate, and transform. Store in a data lake or feature store.

Advantages

  • โ€ขFeature stores eliminate training-serving skew
  • โ€ขMonitoring catches model degradation early
  • โ€ขMLOps practices bring software engineering rigor to ML

Disadvantages

  • โ€ขML infrastructure is complex and expensive
  • โ€ขData quality issues are the #1 cause of ML failures
  • โ€ขRetraining pipelines add operational overhead

๐Ÿงช Test Your Understanding

Knowledge Check1/1

What is training-serving skew?