Chapter 02 — Loading

Data Loading

How to load data from different sources — CSV, Excel, JSON, databases, and web APIs.

2.0 Source selection guide
Source typeUse whenWhy use itSkip / caution
CSVFlat tabular data, easy sharingSimple and universal formatSkip for very large repeated pipelines where Parquet is faster and smaller
ExcelBusiness files with multiple sheets/manual inputsGood for stakeholder-ready filesSkip as a primary storage for big automated jobs
JSON/APINested web/app eventsCaptures rich hierarchical dataSkip direct analysis before flattening and validating schema drift
SQLProduction-grade relational dataFilter at source and keep logic reproducibleSkip full table pulls when only a subset is needed
ParquetLarge analytics pipelinesColumnar + compressed + fast read/writeSkip if your consumers only use Excel tools
System to use for repeatable analysis: keep raw data in files or database tables, run cleaning/transforms in Python (pandas), and save processed outputs in Parquet plus a final CSV/Excel for sharing. This gives speed for you and compatibility for others.
DataXForgeConvert formats without code: CSV → JSON · Excel → CSV · JSON ↔ YAML · XML → JSON · TSV → CSV.
2.1 Load CSV file
python
# Basic
df = pd.read_csv('data/raw/file.csv')

# With options
df = pd.read_csv(
    'data/raw/file.csv',
    sep=',',                        # delimiter (use '	' for TSV)
    encoding='utf-8',               # try 'latin-1' if errors occur
    parse_dates=['date_column'],     # auto-parse date columns
    index_col=None,                   # don't use any column as index
    na_values=['N/A', '--', '', 'null'],
    nrows=1000                        # load first 1000 rows only
)
2.2 Load Excel
python
# Single sheet
df = pd.read_excel('data/raw/file.xlsx', sheet_name='Sheet1')

# All sheets into a dictionary
all_sheets = pd.read_excel('file.xlsx', sheet_name=None)
df = all_sheets['Sales']      # access specific sheet
print(all_sheets.keys())      # see all sheet names
2.3 Load JSON / Web APIs
python
# JSON file
df = pd.read_json('data.json')

# From a REST API
import requests
response = requests.get('https://api.example.com/data')
data = response.json()
df = pd.DataFrame(data)           # if data is a list of records
df = pd.DataFrame(data['results'])  # if nested
2.4 Load from SQL database
python
import sqlite3

# SQLite (local file)
conn = sqlite3.connect('database.db')
df = pd.read_sql_query("SELECT * FROM sales WHERE year = 2024", conn)
conn.close()

# MySQL / PostgreSQL (via sqlalchemy)
import sqlalchemy
engine = sqlalchemy.create_engine('mysql+pymysql://user:pass@host/dbname')
df = pd.read_sql("SELECT * FROM orders LIMIT 10000", engine)
2.5 Save your data
python
# CSV (most universal)
df.to_csv('data/processed/clean.csv', index=False)

# Excel
df.to_excel('output.xlsx', sheet_name='Results', index=False)

# Parquet (fast, compressed — best for large files)
df.to_parquet('data/processed/clean.parquet', index=False)
Common mistakes to avoid
Quick cheatsheet
df.info() -> Structure and non-null counts
df.describe() -> Numeric summary statistics
df.isnull().sum() -> Missing-value counts by column
df.groupby() -> Segmented aggregation
pd.merge() -> Join multiple datasets