Chapter 02 — Loading
Data Loading
How to load data from different sources — CSV, Excel, JSON, databases, and web APIs.
2.0 Source selection guide
| Source type | Use when | Why use it | Skip / caution |
|---|---|---|---|
| CSV | Flat tabular data, easy sharing | Simple and universal format | Skip for very large repeated pipelines where Parquet is faster and smaller |
| Excel | Business files with multiple sheets/manual inputs | Good for stakeholder-ready files | Skip as a primary storage for big automated jobs |
| JSON/API | Nested web/app events | Captures rich hierarchical data | Skip direct analysis before flattening and validating schema drift |
| SQL | Production-grade relational data | Filter at source and keep logic reproducible | Skip full table pulls when only a subset is needed |
| Parquet | Large analytics pipelines | Columnar + compressed + fast read/write | Skip if your consumers only use Excel tools |
System to use for repeatable analysis: keep raw data in files or database tables, run cleaning/transforms in Python (pandas), and save processed outputs in Parquet plus a final CSV/Excel for sharing. This gives speed for you and compatibility for others.
DataXForgeConvert formats without code: CSV → JSON · Excel → CSV · JSON ↔ YAML · XML → JSON · TSV → CSV.
2.1 Load CSV file
python
# Basic df = pd.read_csv('data/raw/file.csv') # With options df = pd.read_csv( 'data/raw/file.csv', sep=',', # delimiter (use ' ' for TSV) encoding='utf-8', # try 'latin-1' if errors occur parse_dates=['date_column'], # auto-parse date columns index_col=None, # don't use any column as index na_values=['N/A', '--', '', 'null'], nrows=1000 # load first 1000 rows only )
2.2 Load Excel
python
# Single sheet df = pd.read_excel('data/raw/file.xlsx', sheet_name='Sheet1') # All sheets into a dictionary all_sheets = pd.read_excel('file.xlsx', sheet_name=None) df = all_sheets['Sales'] # access specific sheet print(all_sheets.keys()) # see all sheet names
2.3 Load JSON / Web APIs
python
# JSON file df = pd.read_json('data.json') # From a REST API import requests response = requests.get('https://api.example.com/data') data = response.json() df = pd.DataFrame(data) # if data is a list of records df = pd.DataFrame(data['results']) # if nested
2.4 Load from SQL database
python
import sqlite3 # SQLite (local file) conn = sqlite3.connect('database.db') df = pd.read_sql_query("SELECT * FROM sales WHERE year = 2024", conn) conn.close() # MySQL / PostgreSQL (via sqlalchemy) import sqlalchemy engine = sqlalchemy.create_engine('mysql+pymysql://user:pass@host/dbname') df = pd.read_sql("SELECT * FROM orders LIMIT 10000", engine)
2.5 Save your data
python
# CSV (most universal) df.to_csv('data/processed/clean.csv', index=False) # Excel df.to_excel('output.xlsx', sheet_name='Results', index=False) # Parquet (fast, compressed — best for large files) df.to_parquet('data/processed/clean.parquet', index=False)
Common mistakes to avoid
- Skipping business context before running technical steps
- Not writing assumptions and limitations explicitly
- Treating one metric as the full story
Quick cheatsheet
df.info() -> Structure and non-null countsdf.describe() -> Numeric summary statisticsdf.isnull().sum() -> Missing-value counts by columndf.groupby() -> Segmented aggregationpd.merge() -> Join multiple datasets