system design · system-design · domain

Design Real-Time Autonomous-Vehicle Data Ingestion Pipeline

Vehicle → edge → cloud, sensor data compression, Kafka/Pulsar, hot/cold tiering, retraining loop.

expert5hcppkafkaml-aisystem-design
Ask GPTConfidence

Theory

Explanation

Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.

A single AV generates terabytes of sensor data per day. You cannot upload it all. Edge filtering selects rare/important events; selective upload streams them to cloud; data lake stores petabytes; auto-labeling produces training data; retraining loop closes the cycle.

Edge: per-vehicle compute filters sensor stream (camera/LiDAR/radar) keeping events where model disagrees with itself, hard scenarios, rare classes. Events compressed + uploaded via cellular when bandwidth available. Cloud ingest: Kafka/Pulsar ingest stream → object lake (S3-tier) with metadata DB. Auto-labeling pipeline runs heavy models offline (8x larger than on-vehicle) for ground truth. Retraining loop: filter labeled data → train next model → deploy via OTA → repeat.

When to use

AV, drone fleets, robotics fleets.

When not to

Single-vehicle hobby projects (no fleet scale).

flowchart LR
  Sensor[Camera + LiDAR + Radar] --> Edge[On-Vehicle Filter]
  Edge -->|interesting events| Upload[Selective Upload]
  Upload --> Ingest[Cloud Ingest · Kafka]
  Ingest --> Lake[(Data Lake · S3 / GCS)]
  Ingest --> Meta[(Metadata DB)]
  Lake --> Label[Auto-Labeling]
  Label --> Train[Training]
  Train --> Model[(Model Registry)]
  Model -->|OTA| Edge

Key insights

  • Edge filtering is where the 1000x compression happens. Upload only what teaches the model new things.
  • Disagreement-triggered upload: events where on-vehicle model differs from larger shadow model on same frame.
  • Auto-labeling uses heavier models offline, cheaper than human annotation at scale.
  • Retraining cadence balanced against deployment risk; staged rollout via OTA.
  • Metadata indexes (timestamp, location, weather, events) enable querying rare scenarios.