Chapter 25 — Data Engineering

Data Engineering & Pipelines

Where data lives and how it flows: warehouses vs lakes, ETL vs ELT, batch vs streaming, orchestration, dbt, and what scales beyond pandas.

Analysts who understand the pipeline ship faster and break less. You don't have to build it all — but you must know where your data comes from, how fresh it is, and what breaks when it's late.
25.1 The modern data stack
data flow
Sources (apps, APIs, DBs, events)
   │  EXTRACT  (Fivetran / Airbyte / custom)
   ▼
Raw landing  ──► Warehouse / Lake (Snowflake, BigQuery, Redshift, Databricks)
   │  TRANSFORM  (dbt / SQL)
   ▼
Modeled tables (clean, tested, documented)
   │
   ├──► BI dashboards (Looker, Tableau, Power BI)
   ├──► ML feature store / training
   └──► reverse-ETL back into tools
25.2 ETL vs ELT
ETL (older)
Transform before loading. Rigid, transforms on a separate server. Good when storage is expensive or data must be cleaned pre-load.
ELT (modern default)
Load raw first, transform inside the warehouse with SQL/dbt. Cheap storage + powerful warehouses make this the standard. Raw data is always recoverable.
25.3 Warehouse vs lake vs lakehouse
StoreHoldsBest forExamples
WarehouseStructured tablesBI, SQL analyticsSnowflake, BigQuery
Data lakeRaw files (any type)Cheap mass storage, MLS3, GCS + Parquet
LakehouseFiles + table layerBoth worlds, big dataDatabricks/Delta, Iceberg
Store columnar files as Parquet, not CSV — typed, compressed, and 5–100× faster to scan because engines read only the columns they need.
25.4 Batch vs streaming
how fresh must it be?
Latency requirement?
│
├── Hours/daily is fine ─────────► Batch (scheduled jobs)
├── Minutes ─────────────────────► Micro-batch
└── Sub-second (fraud, alerts) ──► Streaming (Kafka, Flink, Spark Streaming)
Streaming is far more expensive to build and operate. Don't reach for it unless the business genuinely needs real-time. Most analytics is perfectly served by daily or hourly batch.
25.5 Orchestration — running pipelines reliably

An orchestrator schedules tasks, enforces dependencies (a DAG), retries failures, and alerts you. Without one you have brittle cron jobs and silent failures.

python
# Airflow-style DAG: extract -> transform -> validate -> publish
extract >> transform >> validate >> publish
# each task: retries, timeout, on-failure alert; downstream waits for upstream
ToolUse when
AirflowGeneral-purpose, mature, Python DAGs
dbtSQL transformations + tests + docs in the warehouse
Dagster / PrefectModern, asset-aware, better local dev
Spark / DaskData too big for one machine's RAM
25.6 When pandas stops scaling
data size
Dataset size vs RAM?
│
├── Fits comfortably ──────────► pandas
├── Larger than RAM, 1 machine ► Polars / DuckDB (out-of-core)
└── Too big for 1 machine ─────► Spark / Dask / BigQuery

Professional recommendation

Default stackELT into a warehouse + dbt
File formatParquet, partitioned
OrchestrateAirflow / Dagster DAG
Big single-boxDuckDB or Polars
Common mistakes to avoid
Quick cheatsheet
ELT -> load raw, transform in warehouse
.parquet -> columnar, typed, compressed
dbt run / dbt test -> transform + validate
DuckDB -> SQL on local files, out-of-core
DAG -> task dependencies + retries