Chapter 01 — Setup
Environment & Libraries
Install and import all essential Python libraries for data analytics. Copy this template at the start of every project.
Clear Start for Every Analysis Project
This notebook is your step-by-step system. Start with context, inspect data quality, then clean, analyze, validate, and report. Follow the same order every time for reliable results.
Beginner friendly
End-to-end workflow
Reusable template
DataXForgePrefer no-code? Every stage in this handbook also has a free in-browser tool. Start with Universal Data Lab — drop any CSV/JSON/Excel file to auto-detect, profile, clean, visualize, and export. 100% client-side, nothing is uploaded.
1.0 Quick start in 2 minutes
Step A
Write the business question in one sentence and define the KPI.
Step B
Load data and run shape, info, null, and duplicate checks first.
Step C
Use the scenario guide below to choose the right method quickly.
Step D
Finish with clear findings, limitations, and next actions.
1.1 How to use this notebook
Use this notebook as a repeatable analysis process. For every dataset, move in the same order: define the question, understand the source, inspect the raw data, clean carefully, explore patterns, test ideas, and report the result clearly. Do not jump to charts or modeling before you know what the data contains.
1.2 Analysis workflow
- Define the questionWrite down the exact problem you want to solve, the decision it supports, and the key metric you care about.
- Understand the datasetCheck where the data came from, what each row represents, the date range, the unit of measure, and the main column meanings.
- Inspect the raw dataLook at shape, columns, dtypes, missing values, duplicates, and unusual values before making changes.
- Protect the original dataCreate a raw backup and work on a copy so you can always compare your cleaned version with the source.
- Clean with intentionFix types, standardize text, handle missing values, remove duplicates, and decide how outliers should be treated.
- Explore patternsUse descriptive statistics, grouped summaries, and visualizations to understand trends, segments, and relationships.
- Validate findingsAsk whether the results make sense, check edge cases, and confirm important conclusions with more than one method.
- Summarize the storyFinish with clear insights, limitations, next steps, and the business recommendation that follows from the data.
1.3 Scenario guide
| Scenario | Use this | Why |
|---|---|---|
| You are seeing the dataset for the first time | df.shape, df.head(), df.info(), df.isnull().sum() | These tell you what the data looks like, what types exist, and whether the data needs cleaning before analysis. |
| The data has missing values | Mean for balanced numeric data, median for skewed numeric data, mode or Unknown for categories | Different fill methods work better depending on the shape and meaning of the column. |
| You want to compare groups | Bar chart, boxplot, t-test, or ANOVA | These help you compare values across departments, regions, age groups, or categories. |
| You want to study trend over time | Line chart, rolling average, date features like month or weekday | Time-based data is easier to understand when you keep the order and show movement across dates. |
| The data is skewed or has outliers | Median, IQR clipping, log transform | These reduce the effect of extreme values and give a more stable summary. |
| You want to combine multiple files | pd.merge() for related tables, pd.concat() for stacking similar tables | Merge is for matching keys; concat is for adding more rows of the same structure. |
| You need a dashboard or report | Pivot table, groupby summary, export to Excel/CSV, clear chart labels | Stakeholders need summarized and readable outputs, not raw row-level data. |
| You want to predict a category | Logistic Regression, Random Forest Classifier, F1 score, ROC-AUC | Use classification methods when the target is a label like yes/no or churn/not churn. |
| You want to predict a number | Linear Regression, Random Forest Regressor, MAE, RMSE | Use regression when the target is continuous, such as revenue, price, or demand. |
1.4 What to write before analysis
Record the source, the date you received the file, the expected grain of the data, the target variable, and the most important columns. Also write the business goal in one sentence. This gives you a clear reference when you later explain why you cleaned, grouped, filtered, or modeled the data in a certain way.
1.5 When to use common statistics
| Statistic | When to use | Why use it | When to skip |
|---|---|---|---|
| Mean | Data is fairly symmetric and has no extreme outliers | Shows the average value clearly | Skip for skewed data, salary, income, house prices, or data with big outliers |
| Median | Data is skewed or contains outliers | Gives the middle value and is resistant to extreme values | Skip only if you need a pure arithmetic average for a balanced distribution |
| Mode | Need the most common category or value | Helps with categorical data and repeated values | Skip when every value is almost unique or the category has too many levels |
| Standard deviation | You want to measure spread around the mean | Shows how consistent or variable the data is | Skip if the mean is not a meaningful center because the distribution is heavily skewed |
| Percentiles / quantiles | You want thresholds, ranks, or segment cutoffs | Useful for top 10%, quartiles, and outlier boundaries | Skip if you only need a simple summary and not a split of the data |
| Correlation | You want to check whether two numeric variables move together | Quick way to identify related variables | Skip for causal claims, categorical-only data, or non-linear relationships without more testing |
| T-test / ANOVA | You want to compare group means | Tests whether differences are likely real or due to chance | Skip if the data is not numeric, groups are too small, or assumptions are badly violated |
| Chi-square | You want to test association between categorical variables | Checks whether categories are independent | Skip for continuous numeric variables |
1.6 Install libraries
Run these once in your terminal or first Jupyter cell.
terminal
pip install pandas numpy matplotlib seaborn scipy scikit-learn plotly openpyxl
1.7 Standard imports — copy every project
python
# === CORE === import pandas as pd import numpy as np # === VISUALIZATION === import matplotlib.pyplot as plt import seaborn as sns import plotly.express as px # === STATISTICS === from scipy import stats # === MACHINE LEARNING === from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder from sklearn.metrics import accuracy_score, classification_report # === DISPLAY SETTINGS === pd.set_option('display.max_columns', None) pd.set_option('display.max_rows', 100) pd.set_option('display.float_format', lambda x: '%.3f' % x) sns.set_theme(style="whitegrid") %matplotlib inline
1.8 Project folder structure
recommended layout
my_project/ ├── data/ │ ├── raw/ # original files — never modify! │ ├── processed/ # cleaned data │ └── output/ # results, exports ├── notebooks/ # your .ipynb files ├── reports/ # charts, PDFs └── README.md
Always keep the raw data untouched. Work on copies only. Rule: never overwrite raw data.
Project Execution Template
Use this template before you start every project. It keeps analysis focused and interview-ready.
| Field | Your note |
|---|---|
| Problem statement | What business issue are you solving? |
| Dataset | Name, source, timeframe, granularity |
| Objective | What decision should this analysis support? |
| KPI / metric | Primary success measure |
| Hypothesis | What do you expect and why? |
- Business question clearly defined
- Stakeholder identified
- Expected output decided (dashboard / report / model)
Why this helps: it makes your notebook reusable for portfolio and job interviews, not just one-off analysis.
Common mistakes to avoid
- Skipping business context before running technical steps
- Not writing assumptions and limitations explicitly
- Treating one metric as the full story
Quick cheatsheet
df.info() -> Structure and non-null countsdf.describe() -> Numeric summary statisticsdf.isnull().sum() -> Missing-value counts by columndf.groupby() -> Segmented aggregationpd.merge() -> Join multiple datasets