Chapter 01 — Setup

Environment & Libraries

Install and import all essential Python libraries for data analytics. Copy this template at the start of every project.

Clear Start for Every Analysis Project

This notebook is your step-by-step system. Start with context, inspect data quality, then clean, analyze, validate, and report. Follow the same order every time for reliable results.

Beginner friendly End-to-end workflow Reusable template
DataXForgePrefer no-code? Every stage in this handbook also has a free in-browser tool. Start with Universal Data Lab — drop any CSV/JSON/Excel file to auto-detect, profile, clean, visualize, and export. 100% client-side, nothing is uploaded.
1.0 Quick start in 2 minutes

Step A

Write the business question in one sentence and define the KPI.

Step B

Load data and run shape, info, null, and duplicate checks first.

Step C

Use the scenario guide below to choose the right method quickly.

Step D

Finish with clear findings, limitations, and next actions.

1.1 How to use this notebook
Use this notebook as a repeatable analysis process. For every dataset, move in the same order: define the question, understand the source, inspect the raw data, clean carefully, explore patterns, test ideas, and report the result clearly. Do not jump to charts or modeling before you know what the data contains.
1.2 Analysis workflow
  1. Define the questionWrite down the exact problem you want to solve, the decision it supports, and the key metric you care about.
  2. Understand the datasetCheck where the data came from, what each row represents, the date range, the unit of measure, and the main column meanings.
  3. Inspect the raw dataLook at shape, columns, dtypes, missing values, duplicates, and unusual values before making changes.
  4. Protect the original dataCreate a raw backup and work on a copy so you can always compare your cleaned version with the source.
  5. Clean with intentionFix types, standardize text, handle missing values, remove duplicates, and decide how outliers should be treated.
  6. Explore patternsUse descriptive statistics, grouped summaries, and visualizations to understand trends, segments, and relationships.
  7. Validate findingsAsk whether the results make sense, check edge cases, and confirm important conclusions with more than one method.
  8. Summarize the storyFinish with clear insights, limitations, next steps, and the business recommendation that follows from the data.
1.3 Scenario guide
ScenarioUse thisWhy
You are seeing the dataset for the first timedf.shape, df.head(), df.info(), df.isnull().sum()These tell you what the data looks like, what types exist, and whether the data needs cleaning before analysis.
The data has missing valuesMean for balanced numeric data, median for skewed numeric data, mode or Unknown for categoriesDifferent fill methods work better depending on the shape and meaning of the column.
You want to compare groupsBar chart, boxplot, t-test, or ANOVAThese help you compare values across departments, regions, age groups, or categories.
You want to study trend over timeLine chart, rolling average, date features like month or weekdayTime-based data is easier to understand when you keep the order and show movement across dates.
The data is skewed or has outliersMedian, IQR clipping, log transformThese reduce the effect of extreme values and give a more stable summary.
You want to combine multiple filespd.merge() for related tables, pd.concat() for stacking similar tablesMerge is for matching keys; concat is for adding more rows of the same structure.
You need a dashboard or reportPivot table, groupby summary, export to Excel/CSV, clear chart labelsStakeholders need summarized and readable outputs, not raw row-level data.
You want to predict a categoryLogistic Regression, Random Forest Classifier, F1 score, ROC-AUCUse classification methods when the target is a label like yes/no or churn/not churn.
You want to predict a numberLinear Regression, Random Forest Regressor, MAE, RMSEUse regression when the target is continuous, such as revenue, price, or demand.
1.4 What to write before analysis
Record the source, the date you received the file, the expected grain of the data, the target variable, and the most important columns. Also write the business goal in one sentence. This gives you a clear reference when you later explain why you cleaned, grouped, filtered, or modeled the data in a certain way.
1.5 When to use common statistics
StatisticWhen to useWhy use itWhen to skip
MeanData is fairly symmetric and has no extreme outliersShows the average value clearlySkip for skewed data, salary, income, house prices, or data with big outliers
MedianData is skewed or contains outliersGives the middle value and is resistant to extreme valuesSkip only if you need a pure arithmetic average for a balanced distribution
ModeNeed the most common category or valueHelps with categorical data and repeated valuesSkip when every value is almost unique or the category has too many levels
Standard deviationYou want to measure spread around the meanShows how consistent or variable the data isSkip if the mean is not a meaningful center because the distribution is heavily skewed
Percentiles / quantilesYou want thresholds, ranks, or segment cutoffsUseful for top 10%, quartiles, and outlier boundariesSkip if you only need a simple summary and not a split of the data
CorrelationYou want to check whether two numeric variables move togetherQuick way to identify related variablesSkip for causal claims, categorical-only data, or non-linear relationships without more testing
T-test / ANOVAYou want to compare group meansTests whether differences are likely real or due to chanceSkip if the data is not numeric, groups are too small, or assumptions are badly violated
Chi-squareYou want to test association between categorical variablesChecks whether categories are independentSkip for continuous numeric variables
1.6 Install libraries
Run these once in your terminal or first Jupyter cell.
terminal
pip install pandas numpy matplotlib seaborn scipy scikit-learn plotly openpyxl
1.7 Standard imports — copy every project
python
# === CORE ===
import pandas as pd
import numpy as np

# === VISUALIZATION ===
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# === STATISTICS ===
from scipy import stats

# === MACHINE LEARNING ===
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report

# === DISPLAY SETTINGS ===
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
sns.set_theme(style="whitegrid")
%matplotlib inline
1.8 Project folder structure
recommended layout
my_project/
├── data/
│   ├── raw/          # original files — never modify!
│   ├── processed/    # cleaned data
│   └── output/       # results, exports
├── notebooks/        # your .ipynb files
├── reports/          # charts, PDFs
└── README.md
Always keep the raw data untouched. Work on copies only. Rule: never overwrite raw data.
Project Execution Template
Use this template before you start every project. It keeps analysis focused and interview-ready.
FieldYour note
Problem statementWhat business issue are you solving?
DatasetName, source, timeframe, granularity
ObjectiveWhat decision should this analysis support?
KPI / metricPrimary success measure
HypothesisWhat do you expect and why?
Why this helps: it makes your notebook reusable for portfolio and job interviews, not just one-off analysis.
Common mistakes to avoid
Quick cheatsheet
df.info() -> Structure and non-null counts
df.describe() -> Numeric summary statistics
df.isnull().sum() -> Missing-value counts by column
df.groupby() -> Segmented aggregation
pd.merge() -> Join multiple datasets