Chapter 25 — Data Engineering

Data Engineering & Pipelines

Where data lives and how it flows: warehouses vs lakes, ETL vs ELT, batch vs streaming, orchestration, dbt, and what scales beyond pandas.

Analysts who understand the pipeline ship faster and break less. You don't have to build it all — but you must know where your data comes from, how fresh it is, and what breaks when it's late.

25.1 The modern data stack

data flow

Sources (apps, APIs, DBs, events)
   │  EXTRACT  (Fivetran / Airbyte / custom)
   ▼
Raw landing  ──► Warehouse / Lake (Snowflake, BigQuery, Redshift, Databricks)
   │  TRANSFORM  (dbt / SQL)
   ▼
Modeled tables (clean, tested, documented)
   │
   ├──► BI dashboards (Looker, Tableau, Power BI)
   ├──► ML feature store / training
   └──► reverse-ETL back into tools

25.2 ETL vs ELT

ETL (older)

Transform before loading. Rigid, transforms on a separate server. Good when storage is expensive or data must be cleaned pre-load.

ELT (modern default)

Load raw first, transform inside the warehouse with SQL/dbt. Cheap storage + powerful warehouses make this the standard. Raw data is always recoverable.

25.3 Warehouse vs lake vs lakehouse

Store	Holds	Best for	Examples
Warehouse	Structured tables	BI, SQL analytics	Snowflake, BigQuery
Data lake	Raw files (any type)	Cheap mass storage, ML	S3, GCS + Parquet
Lakehouse	Files + table layer	Both worlds, big data	Databricks/Delta, Iceberg

Store columnar files as Parquet, not CSV — typed, compressed, and 5–100× faster to scan because engines read only the columns they need.

25.4 Batch vs streaming

how fresh must it be?

Latency requirement?
│
├── Hours/daily is fine ─────────► Batch (scheduled jobs)
├── Minutes ─────────────────────► Micro-batch
└── Sub-second (fraud, alerts) ──► Streaming (Kafka, Flink, Spark Streaming)

Streaming is far more expensive to build and operate. Don't reach for it unless the business genuinely needs real-time. Most analytics is perfectly served by daily or hourly batch.

25.5 Orchestration — running pipelines reliably

An orchestrator schedules tasks, enforces dependencies (a DAG), retries failures, and alerts you. Without one you have brittle cron jobs and silent failures.

python

# Airflow-style DAG: extract -> transform -> validate -> publish
extract >> transform >> validate >> publish
# each task: retries, timeout, on-failure alert; downstream waits for upstream

Tool	Use when
Airflow	General-purpose, mature, Python DAGs
dbt	SQL transformations + tests + docs in the warehouse
Dagster / Prefect	Modern, asset-aware, better local dev
Spark / Dask	Data too big for one machine's RAM

25.6 When pandas stops scaling

data size

Dataset size vs RAM?
│
├── Fits comfortably ──────────► pandas
├── Larger than RAM, 1 machine ► Polars / DuckDB (out-of-core)
└── Too big for 1 machine ─────► Spark / Dask / BigQuery

Professional recommendation

Default stackELT into a warehouse + dbt

File formatParquet, partitioned

OrchestrateAirflow / Dagster DAG

Big single-boxDuckDB or Polars

Common mistakes to avoid

Building streaming when daily batch would do — huge cost, little value
Storing analytical data as CSV instead of Parquet
Cron jobs with no retries, alerting, or dependency management
Transforming in fragile notebooks instead of tested dbt models
Loading a 50 GB table into pandas instead of querying the warehouse

Quick cheatsheet

ELT -> load raw, transform in warehouse

.parquet -> columnar, typed, compressed

dbt run / dbt test -> transform + validate

DuckDB -> SQL on local files, out-of-core

DAG -> task dependencies + retries