Chapter 02 — Loading

Data Loading

How to load data from different sources — CSV, Excel, JSON, databases, and web APIs.

2.0 Source selection guide

Source type	Use when	Why use it	Skip / caution
CSV	Flat tabular data, easy sharing	Simple and universal format	Skip for very large repeated pipelines where Parquet is faster and smaller
Excel	Business files with multiple sheets/manual inputs	Good for stakeholder-ready files	Skip as a primary storage for big automated jobs
JSON/API	Nested web/app events	Captures rich hierarchical data	Skip direct analysis before flattening and validating schema drift
SQL	Production-grade relational data	Filter at source and keep logic reproducible	Skip full table pulls when only a subset is needed
Parquet	Large analytics pipelines	Columnar + compressed + fast read/write	Skip if your consumers only use Excel tools

System to use for repeatable analysis: keep raw data in files or database tables, run cleaning/transforms in Python (pandas), and save processed outputs in Parquet plus a final CSV/Excel for sharing. This gives speed for you and compatibility for others.

DataXForgeConvert formats without code: CSV → JSON · Excel → CSV · JSON ↔ YAML · XML → JSON · TSV → CSV.

2.1 Load CSV file

python

# Basic
df = pd.read_csv('data/raw/file.csv')

# With options
df = pd.read_csv(
    'data/raw/file.csv',
    sep=',',                        # delimiter (use '	' for TSV)
    encoding='utf-8',               # try 'latin-1' if errors occur
    parse_dates=['date_column'],     # auto-parse date columns
    index_col=None,                   # don't use any column as index
    na_values=['N/A', '--', '', 'null'],
    nrows=1000                        # load first 1000 rows only
)

2.2 Load Excel

python

# Single sheet
df = pd.read_excel('data/raw/file.xlsx', sheet_name='Sheet1')

# All sheets into a dictionary
all_sheets = pd.read_excel('file.xlsx', sheet_name=None)
df = all_sheets['Sales']      # access specific sheet
print(all_sheets.keys())      # see all sheet names

2.3 Load JSON / Web APIs

python

# JSON file
df = pd.read_json('data.json')

# From a REST API
import requests
response = requests.get('https://api.example.com/data')
data = response.json()
df = pd.DataFrame(data)           # if data is a list of records
df = pd.DataFrame(data['results'])  # if nested

2.4 Load from SQL database

python

import sqlite3

# SQLite (local file)
conn = sqlite3.connect('database.db')
df = pd.read_sql_query("SELECT * FROM sales WHERE year = 2024", conn)
conn.close()

# MySQL / PostgreSQL (via sqlalchemy)
import sqlalchemy
engine = sqlalchemy.create_engine('mysql+pymysql://user:pass@host/dbname')
df = pd.read_sql("SELECT * FROM orders LIMIT 10000", engine)

2.5 Save your data

python

# CSV (most universal)
df.to_csv('data/processed/clean.csv', index=False)

# Excel
df.to_excel('output.xlsx', sheet_name='Results', index=False)

# Parquet (fast, compressed — best for large files)
df.to_parquet('data/processed/clean.parquet', index=False)

Common mistakes to avoid

Skipping business context before running technical steps
Not writing assumptions and limitations explicitly
Treating one metric as the full story

Quick cheatsheet

df.info() -> Structure and non-null counts

df.describe() -> Numeric summary statistics

df.isnull().sum() -> Missing-value counts by column

df.groupby() -> Segmented aggregation

pd.merge() -> Join multiple datasets