Chapter 01 — Setup

Environment & Libraries

Install and import all essential Python libraries for data analytics. Copy this template at the start of every project.

Clear Start for Every Analysis Project

This notebook is your step-by-step system. Start with context, inspect data quality, then clean, analyze, validate, and report. Follow the same order every time for reliable results.

Beginner friendly End-to-end workflow Reusable template

DataXForgePrefer no-code? Every stage in this handbook also has a free in-browser tool. Start with Universal Data Lab — drop any CSV/JSON/Excel file to auto-detect, profile, clean, visualize, and export. 100% client-side, nothing is uploaded.

1.0 Quick start in 2 minutes

Step A

Write the business question in one sentence and define the KPI.

Step B

Load data and run shape, info, null, and duplicate checks first.

Step C

Use the scenario guide below to choose the right method quickly.

Step D

Finish with clear findings, limitations, and next actions.

1.1 How to use this notebook

Use this notebook as a repeatable analysis process. For every dataset, move in the same order: define the question, understand the source, inspect the raw data, clean carefully, explore patterns, test ideas, and report the result clearly. Do not jump to charts or modeling before you know what the data contains.

1.2 Analysis workflow

Define the questionWrite down the exact problem you want to solve, the decision it supports, and the key metric you care about.
Understand the datasetCheck where the data came from, what each row represents, the date range, the unit of measure, and the main column meanings.
Inspect the raw dataLook at shape, columns, dtypes, missing values, duplicates, and unusual values before making changes.
Protect the original dataCreate a raw backup and work on a copy so you can always compare your cleaned version with the source.
Clean with intentionFix types, standardize text, handle missing values, remove duplicates, and decide how outliers should be treated.
Explore patternsUse descriptive statistics, grouped summaries, and visualizations to understand trends, segments, and relationships.
Validate findingsAsk whether the results make sense, check edge cases, and confirm important conclusions with more than one method.
Summarize the storyFinish with clear insights, limitations, next steps, and the business recommendation that follows from the data.

1.3 Scenario guide

Scenario	Use this	Why
You are seeing the dataset for the first time	`df.shape`, `df.head()`, `df.info()`, `df.isnull().sum()`	These tell you what the data looks like, what types exist, and whether the data needs cleaning before analysis.
The data has missing values	Mean for balanced numeric data, median for skewed numeric data, mode or `Unknown` for categories	Different fill methods work better depending on the shape and meaning of the column.
You want to compare groups	Bar chart, boxplot, t-test, or ANOVA	These help you compare values across departments, regions, age groups, or categories.
You want to study trend over time	Line chart, rolling average, date features like month or weekday	Time-based data is easier to understand when you keep the order and show movement across dates.
The data is skewed or has outliers	Median, IQR clipping, log transform	These reduce the effect of extreme values and give a more stable summary.
You want to combine multiple files	`pd.merge()` for related tables, `pd.concat()` for stacking similar tables	Merge is for matching keys; concat is for adding more rows of the same structure.
You need a dashboard or report	Pivot table, groupby summary, export to Excel/CSV, clear chart labels	Stakeholders need summarized and readable outputs, not raw row-level data.
You want to predict a category	Logistic Regression, Random Forest Classifier, F1 score, ROC-AUC	Use classification methods when the target is a label like yes/no or churn/not churn.
You want to predict a number	Linear Regression, Random Forest Regressor, MAE, RMSE	Use regression when the target is continuous, such as revenue, price, or demand.

1.4 What to write before analysis

Record the source, the date you received the file, the expected grain of the data, the target variable, and the most important columns. Also write the business goal in one sentence. This gives you a clear reference when you later explain why you cleaned, grouped, filtered, or modeled the data in a certain way.

1.5 When to use common statistics

Statistic	When to use	Why use it	When to skip
Mean	Data is fairly symmetric and has no extreme outliers	Shows the average value clearly	Skip for skewed data, salary, income, house prices, or data with big outliers
Median	Data is skewed or contains outliers	Gives the middle value and is resistant to extreme values	Skip only if you need a pure arithmetic average for a balanced distribution
Mode	Need the most common category or value	Helps with categorical data and repeated values	Skip when every value is almost unique or the category has too many levels
Standard deviation	You want to measure spread around the mean	Shows how consistent or variable the data is	Skip if the mean is not a meaningful center because the distribution is heavily skewed
Percentiles / quantiles	You want thresholds, ranks, or segment cutoffs	Useful for top 10%, quartiles, and outlier boundaries	Skip if you only need a simple summary and not a split of the data
Correlation	You want to check whether two numeric variables move together	Quick way to identify related variables	Skip for causal claims, categorical-only data, or non-linear relationships without more testing
T-test / ANOVA	You want to compare group means	Tests whether differences are likely real or due to chance	Skip if the data is not numeric, groups are too small, or assumptions are badly violated
Chi-square	You want to test association between categorical variables	Checks whether categories are independent	Skip for continuous numeric variables

1.6 Install libraries

Run these once in your terminal or first Jupyter cell.

terminal

pip install pandas numpy matplotlib seaborn scipy scikit-learn plotly openpyxl

1.7 Standard imports — copy every project

python

# === CORE ===
import pandas as pd
import numpy as np

# === VISUALIZATION ===
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# === STATISTICS ===
from scipy import stats

# === MACHINE LEARNING ===
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report

# === DISPLAY SETTINGS ===
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
sns.set_theme(style="whitegrid")
%matplotlib inline

1.8 Project folder structure

recommended layout

my_project/
├── data/
│   ├── raw/          # original files — never modify!
│   ├── processed/    # cleaned data
│   └── output/       # results, exports
├── notebooks/        # your .ipynb files
├── reports/          # charts, PDFs
└── README.md

Always keep the raw data untouched. Work on copies only. Rule: never overwrite raw data.

Project Execution Template

Use this template before you start every project. It keeps analysis focused and interview-ready.

Field	Your note
Problem statement	What business issue are you solving?
Dataset	Name, source, timeframe, granularity
Objective	What decision should this analysis support?
KPI / metric	Primary success measure
Hypothesis	What do you expect and why?

Business question clearly defined
Stakeholder identified
Expected output decided (dashboard / report / model)

Why this helps: it makes your notebook reusable for portfolio and job interviews, not just one-off analysis.

Common mistakes to avoid

Skipping business context before running technical steps
Not writing assumptions and limitations explicitly
Treating one metric as the full story

Quick cheatsheet

df.info() -> Structure and non-null counts

df.describe() -> Numeric summary statistics

df.isnull().sum() -> Missing-value counts by column

df.groupby() -> Segmented aggregation

pd.merge() -> Join multiple datasets