Data Analysis with Pandas: A Practical Beginner's Guide
What is Pandas and Why is it Essential?
Pandas is an open-source Python library that provides high-performance, expressive data structures for working with structured (tabular) data. Created by Wes McKinney in 2008 while working at a hedge fund (where analysing financial time-series data was critical), it has become the absolute backbone of data analysis in Python across every industry.
Data analysts, data scientists, financial modellers, research scientists, and machine learning engineers all rely on Pandas every day. If you regularly work with spreadsheets, CSV files, SQL query results, Excel reports, or any kind of tabular data, Pandas will transform how you work — replacing hours of manual Excel manipulation with a few lines of clean, repeatable, shareable Python code.
The core data structure is the DataFrame — think of it as an intelligent, programmable spreadsheet with labelled rows and columns. You get hundreds of built-in methods for filtering, transforming, aggregating, and visualising data. The companion Series structure represents a single column of data.
Installation and Importing
pip install pandas matplotlib openpyxl
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Always check versions when debugging library issues
print(pd.__version__)
Creating DataFrames
# Create from a Python dictionary
students = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie", "Diana", "Eshan"],
"score": [88, 72, 95, 81, 67],
"grade": ["B", "C", "A", "B", "D"],
"passed": [True, True, True, True, False],
"city": ["Delhi", "Mumbai", "Pune", "Bangalore", "Chennai"]
})
print(students)
print(students.dtypes) # check column data types
Loading Real Data from Files
In practice, you will almost always load data from external files rather than creating DataFrames manually:
# Load a CSV file — most common format
df = pd.read_csv("sales_data.csv")
# Load a specific sheet from Excel
df_excel = pd.read_excel("quarterly_report.xlsx", sheet_name="Q1_2026")
# Load from a URL (no download needed!)
url = "https://raw.githubusercontent.com/datasets/sample.csv"
df_web = pd.read_csv(url)
# Save a DataFrame back to CSV
df.to_csv("cleaned_data.csv", index=False) # index=False avoids extra column
Exploring Your Data — Always Do This First
Before any analysis, spend a few minutes fully understanding the shape, types, and quality of your data. Skipping this step leads to incorrect analyses and wasted hours later:
df.shape # (rows, columns) — e.g., (1000, 12)
df.head(10) # first 10 rows
df.tail(5) # last 5 rows
df.info() # column names, types, non-null counts
df.describe() # count, mean, std, min, 25%, 50%, 75%, max
df.columns.tolist() # list of all column names
df.duplicated().sum() # count duplicate rows
df.isnull().sum() # count null values per column
df["score"].value_counts() # frequency of each unique value
Selecting and Filtering Data
# Select a single column (returns a Series)
scores = df["score"]
# Select multiple columns (returns a DataFrame)
subset = df[["name", "score", "grade"]]
# Filter rows: basic condition
high_scorers = df[df["score"] >= 90]
# Filter: multiple conditions (use & for AND, | for OR)
passing_b_students = df[(df["score"] >= 80) & (df["grade"] == "B")]
# Filter rows using .isin() for multiple values
top_cities = df[df["city"].isin(["Delhi", "Mumbai", "Bangalore"])]
# Filter with string methods
delhi_students = df[df["city"].str.startswith("Del")]
# Select rows by position (row 5 to row 10)
middle_rows = df.iloc[5:10]
Cleaning Messy Data
Real-world data is almost never clean. Data cleaning typically takes 70–80% of a data analyst's time. Pandas gives you the tools to handle every common quality issue:
# Find columns with missing values
print(df.isnull().sum())
# Drop rows where ANY value is null
df_clean = df.dropna()
# Drop rows only where specific column is null
df_clean = df.dropna(subset=["score"])
# Fill null values with mean/median/mode/specific value
df["score"] = df["score"].fillna(df["score"].mean())
df["grade"] = df["grade"].fillna("Unknown")
# Remove duplicate rows
df = df.drop_duplicates()
# Drop specific duplicate rows based on a subset of columns
df = df.drop_duplicates(subset=["name", "city"])
# Rename columns with confusing names
df = df.rename(columns={"sc": "score", "nm": "name", "yr": "year"})
# Fix data types
df["score"] = df["score"].astype(int)
df["date"] = pd.to_datetime(df["date"]) # parse date strings
Adding New Columns and Transforming Data
# Create a calculated column
df["score_percent"] = (df["score"] / 100 * 100).round(2)
# Apply a custom function using .apply()
def classify(score):
if score >= 90: return "Distinction"
if score >= 75: return "Merit"
if score >= 60: return "Pass"
return "Fail"
df["result"] = df["score"].apply(classify)
# Lambda for simple transformations
df["score_scaled"] = df["score"].apply(lambda x: (x - df["score"].min()) /
(df["score"].max() - df["score"].min()))
# String transformations
df["name_upper"] = df["name"].str.upper()
df["city_clean"] = df["city"].str.strip().str.title()
Grouping and Aggregation
GroupBy is one of the most powerful features in Pandas — it lets you split data into groups, apply a function to each group, and combine the results:
# Average score by grade
avg_by_grade = df.groupby("grade")["score"].mean()
print(avg_by_grade)
# Multiple aggregations simultaneously
summary = df.groupby("grade").agg(
student_count = ("name", "count"),
average_score = ("score", "mean"),
highest_score = ("score", "max"),
lowest_score = ("score", "min")
).round(1)
print(summary)
# Group by multiple columns
city_grade = df.groupby(["city", "grade"])["score"].mean().unstack()
Quick Visualisation
# Histogram of score distribution
df["score"].hist(bins=15, color="#4f8ef7", edgecolor="white", figsize=(10,5))
plt.title("Score Distribution Across All Students")
plt.xlabel("Score"); plt.ylabel("Frequency")
plt.tight_layout(); plt.show()
# Bar chart — average score by grade
avg_by_grade.sort_index().plot(kind="bar", color="#a259ff", figsize=(8,5))
plt.title("Average Score by Grade")
plt.xticks(rotation=0)
plt.tight_layout(); plt.show()
# Scatter plot
df.plot.scatter(x="score", y="attendance", alpha=0.6, color="#39ff14")
plt.title("Score vs Attendance")
plt.show()