ML / AI Engineer

Transformers, LLM internals, RAG, MLOps, model serving, evaluation, agent architectures. NVIDIA-flavored loops add CUDA depth.

Primary categories: ml-ai · system-design

Curriculum (11 topics)

hard5h

Design Amazon Recommendation System

Candidate generation → ranking → re-ranking pipeline. Feature stores, A/B testing, real-time inference. Amazon signature SDI.

generalkafkaml-aisystem-design

expert5h

Design Real-Time Autonomous-Vehicle Data Ingestion Pipeline

Vehicle → edge → cloud, sensor data compression, Kafka/Pulsar, hot/cold tiering, retraining loop.

cppkafkaml-aisystem-design

expert5h

Design LLM Inference Serving (Triton + vLLM + PagedAttention)

Triton Inference Server, dynamic batching, KV-cache management, tensor parallelism, paged attention.

cudaml-aipythonsystem-design

hard6h

CUDA Shared Memory & Memory Coalescing

GPU performance is bound by memory bandwidth. Shared memory + coalesced global access are the two highest-leverage CUDA optimizations. Required depth at NVIDIA and Tesla Autopilot.

cppcudageneral

expert5h

Design Tesla Autopilot Data Pipeline (Auto-labeling + Retraining)

Auto-labeling, shadow mode, rare-event mining, training data curation, reprocessing.

cppcudaml-aisystem-design

expert6h

Design Distributed Training for 10B+ Param Model on 128 H100/B200

Tensor + pipeline + data parallelism, NCCL all-reduce, FP8/BF16, ZeRO/FSDP, checkpointing. NVIDIA signature SDI.

cppcudaml-aisystem-design

hard4h

Design Dynamic Home-Screen Recommendations

Per-profile row construction, ranking, freshness, real-time signal incorporation. 2025 Netflix-reported prompt.

generalml-aisystem-design

hard4h

Design a GPU Cluster Scheduler (MIG + Gang)

Bin packing, MIG slicing, gang scheduling, K8s GPU operator, preemption, QoS. NVIDIA signature SDI.

cudakubernetessystem-design

expert4h

Design Fault-Tolerant Distributed Training Pipeline

Checkpointing cadence, replica resurrection, NCCL recovery, elastic training. Survival in long-running jobs.

cudaml-aisystem-design

expert5h

Design Netflix Recommendation Engine

Offline candidate generation + online ranking, "row-of-rows" homepage, A/B testing infra, personalization signals.

generalml-aisystem-design

medium4h

STAR & STAR-L Method for Behavioral Stories

Structured story format used by every Mag7 behavioral round. Google extends to STAR-L (Learnings). Amazon expects 1 LP per story. Netflix tunes to culture pillars.

behavioralgeneral