Chapter 25 — Data Engineering
Data Engineering & Pipelines
Where data lives and how it flows: warehouses vs lakes, ETL vs ELT, batch vs streaming, orchestration, dbt, and what scales beyond pandas.
Analysts who understand the pipeline ship faster and break less. You don't have to build it all — but you must know where your data comes from, how fresh it is, and what breaks when it's late.
25.1 The modern data stack
data flow
Sources (apps, APIs, DBs, events)
│ EXTRACT (Fivetran / Airbyte / custom)
▼
Raw landing ──► Warehouse / Lake (Snowflake, BigQuery, Redshift, Databricks)
│ TRANSFORM (dbt / SQL)
▼
Modeled tables (clean, tested, documented)
│
├──► BI dashboards (Looker, Tableau, Power BI)
├──► ML feature store / training
└──► reverse-ETL back into tools25.2 ETL vs ELT
ETL (older)
Transform before loading. Rigid, transforms on a separate server. Good when storage is expensive or data must be cleaned pre-load.ELT (modern default)
Load raw first, transform inside the warehouse with SQL/dbt. Cheap storage + powerful warehouses make this the standard. Raw data is always recoverable.25.3 Warehouse vs lake vs lakehouse
| Store | Holds | Best for | Examples |
|---|---|---|---|
| Warehouse | Structured tables | BI, SQL analytics | Snowflake, BigQuery |
| Data lake | Raw files (any type) | Cheap mass storage, ML | S3, GCS + Parquet |
| Lakehouse | Files + table layer | Both worlds, big data | Databricks/Delta, Iceberg |
Store columnar files as Parquet, not CSV — typed, compressed, and 5–100× faster to scan because engines read only the columns they need.
25.4 Batch vs streaming
how fresh must it be?
Latency requirement? │ ├── Hours/daily is fine ─────────► Batch (scheduled jobs) ├── Minutes ─────────────────────► Micro-batch └── Sub-second (fraud, alerts) ──► Streaming (Kafka, Flink, Spark Streaming)
Streaming is far more expensive to build and operate. Don't reach for it unless the business genuinely needs real-time. Most analytics is perfectly served by daily or hourly batch.
25.5 Orchestration — running pipelines reliably
An orchestrator schedules tasks, enforces dependencies (a DAG), retries failures, and alerts you. Without one you have brittle cron jobs and silent failures.
python
# Airflow-style DAG: extract -> transform -> validate -> publish extract >> transform >> validate >> publish # each task: retries, timeout, on-failure alert; downstream waits for upstream
| Tool | Use when |
|---|---|
| Airflow | General-purpose, mature, Python DAGs |
| dbt | SQL transformations + tests + docs in the warehouse |
| Dagster / Prefect | Modern, asset-aware, better local dev |
| Spark / Dask | Data too big for one machine's RAM |
25.6 When pandas stops scaling
data size
Dataset size vs RAM? │ ├── Fits comfortably ──────────► pandas ├── Larger than RAM, 1 machine ► Polars / DuckDB (out-of-core) └── Too big for 1 machine ─────► Spark / Dask / BigQuery
Professional recommendation
Default stackELT into a warehouse + dbt
File formatParquet, partitioned
OrchestrateAirflow / Dagster DAG
Big single-boxDuckDB or Polars
Common mistakes to avoid
- Building streaming when daily batch would do — huge cost, little value
- Storing analytical data as CSV instead of Parquet
- Cron jobs with no retries, alerting, or dependency management
- Transforming in fragile notebooks instead of tested dbt models
- Loading a 50 GB table into pandas instead of querying the warehouse
Quick cheatsheet
ELT -> load raw, transform in warehouse.parquet -> columnar, typed, compresseddbt run / dbt test -> transform + validateDuckDB -> SQL on local files, out-of-coreDAG -> task dependencies + retries