Exploratory Data Analysis with Python: An Overview

by Selwyn Davidraj Posted on October 12, 2025

This blog provides a practical overview of Exploratory Data Analysis (EDA) using Python, highlighting how to validate and understand your dataset through sanity checks and statistical exploration. It covers key techniques like univariate, bivariate, and multivariate analysis to uncover trends and relationships in data. You’ll also learn essential steps for handling missing values and detecting or treating outliers to ensure data quality before modeling.

Data Cleaning in Python

Data science projects start with understanding and preparing your data. In this guide, we explore high-level Exploratory Data Analysis (EDA) and data cleaning techniques using Python’s pandas, NumPy, and Matplotlib libraries. We’ll use a sample real estate dataset to demonstrate each concept. You can follow along easily in Google Colab or Jupyter Notebook.

1. What is Exploratory Data Analysis (EDA)?

Exploratory Data Analysis (EDA) is the critical first step in any data science workflow. It helps you:

Understand the dataset’s structure and content.
Identify trends, patterns, and relationships.
Detect anomalies like missing or outlying values.

EDA ensures that your dataset is clean and suitable for downstream analysis or machine learning.

2. Sample Dataset Overview

Let’s simulate a real estate dataset with random data for demonstration purposes.

import pandas as pd
import numpy as np

# Sample dataset
np.random.seed(42)
data = pd.DataFrame({
    'price': np.random.randint(300000, 3000000, 100),
    'suburb': np.random.choice(['Melbourne', 'Sydney', 'Brisbane', 'Perth'], 100),
    'rooms': np.random.randint(1, 6, 100),
    'propertyType': np.random.choice(['House', 'Unit', 'Townhouse'], 100),
    'realEstateAgent': np.random.choice(['Agent A', 'Agent B', 'Agent C'], 100),
    'saleDate': pd.date_range(start='2023-01-01', periods=100, freq='D'),
    'distanceToCBD': np.random.uniform(2, 35, 100),
    'buildingArea': np.random.uniform(50, 400, 100)
})

# Introduce some missing values
data.loc[5:10, 'buildingArea'] = np.nan

print(data.head())
print(data.info())

3. Key Steps in EDA

🔹 Step 1: Sanity Checks

Validate dataset integrity and ensure no duplicated or malformed data.

# Check duplicates
duplicates = data.duplicated().sum()
print(f"Duplicate rows: {duplicates}")

# Check datatypes
print(data.dtypes)

🔹 Step 2: Univariate Analysis

Examine statistical properties and value distributions for each feature.

# Summary statistics
print(data['price'].describe())

# Histogram of property prices
import matplotlib.pyplot as plt
plt.figure(figsize=(7,4))
plt.hist(data['price'], bins=15, color='skyblue', edgecolor='black')
plt.title('Distribution of Property Prices')
plt.xlabel('Price (AUD)')
plt.ylabel('Frequency')
plt.show()

🔹 Step 3: Bivariate Analysis

Study relationships between pairs of variables — for instance, price vs. number of rooms.

plt.figure(figsize=(6,4))
plt.scatter(data['rooms'], data['price'], alpha=0.7, color='orange')
plt.title('Price vs. Number of Rooms')
plt.xlabel('Rooms')
plt.ylabel('Price (AUD)')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

🔹 Step 4: Handling Missing Values

Missing or null values can distort analysis and model accuracy.

# Count missing values
print(data.isnull().sum())

# Impute missing values (mean replacement for simplicity)
data['buildingArea'].fillna(data['buildingArea'].mean(), inplace=True)

🔹 Step 5: Detecting Outliers

Outliers may represent genuine extremes or incorrect entries.

# Using IQR to detect price outliers
Q1 = data['price'].quantile(0.25)
Q3 = data['price'].quantile(0.75)
IQR = Q3 - Q1
outliers = data[(data['price'] < Q1 - 1.5*IQR) | (data['price'] > Q3 + 1.5*IQR)]
print(f"Outliers detected: {outliers.shape[0]}")

# Remove outliers if necessary
filtered_data = data[(data['price'] >= Q1 - 1.5*IQR) & (data['price'] <= Q3 + 1.5*IQR)]

4. Data Cleaning and Preparation

After exploration, refine the dataset for consistency and accuracy.

# Convert data types
data['buildingArea'] = pd.to_numeric(data['buildingArea'], errors='coerce')

# Drop duplicates
data.drop_duplicates(inplace=True)

# Handle any remaining missing values
data.dropna(inplace=True)

Clean data ensures your future analyses and models are both accurate and reliable.

5. Summary Statistics and Categorical Analysis

Numerical Summary

print(filtered_data.describe())

Categorical Overview

print(filtered_data['suburb'].value_counts())
print(filtered_data['propertyType'].value_counts())
print(filtered_data['realEstateAgent'].value_counts())

6. Insights and Observations

From the above exploratory steps, we can observe:

Average property prices vary widely between suburbs.
Number of rooms shows positive correlation with price.
Outliers may indicate luxury homes or incorrect data entries.
Missing data in numerical columns can be imputed using mean/median.

🧠 Conclusion

Exploratory Data Analysis (EDA) and data cleaning form the foundation of any reliable data science workflow.

By applying these techniques, you can:

Ensure data integrity.
Identify meaningful patterns.
Prepare clean, analysis-ready datasets.

In this real estate case study, we covered how to:

Run sanity checks.
Explore distributions and relationships.
Handle missing and outlier data.
Prepare structured, reliable datasets for predictive modeling.

Mastering EDA empowers data scientists to transform raw data into actionable insights — a crucial step toward successful machine learning outcomes.

Visualization techniques for Univariate analysis

Histogram

What is a Histogram?

A histogram is a graph that displays the frequency (or count) of data points in consecutive intervals called bins. It’s one of the fundamental tools for understanding the distribution of numerical variables—like car prices, test scores, or ages.

Creating a Histogram with Seaborn

Seaborn is a high-level Python visualization library based on Matplotlib, designed for drawing attractive and informative statistical graphics.

Example Code

import seaborn as sns
import matplotlib.pyplot as plt

# Example data
car_prices = [10000, 12000, 15000, 14000, 13000, 11000, 16000, 19000, 17000, 15000]

# Create histogram
sns.histplot(car_prices)

plt.xlabel('Car Price')
plt.ylabel('Frequency')
plt.title('Distribution of Car Prices')
plt.show()

This produces a histogram showing the distribution of car prices in the dataset.

Histogram

Customizing the Histogram

You can easily modify the appearance of your histogram by customizing:

Axis Labels and Limits
Bar Colors
Number of Bins

Example Code

sns.histplot(car_prices, bins=5, color='skyblue')
plt.xlabel('Car Price (USD)')
plt.ylabel('Number of Cars')
plt.title('Customized Car Price Distribution')
plt.xlim(10000, 20000)
plt.ylim(0, 4)
plt.show()

Histogram with custom color, axis labels, and limits

Histogram - Kernel Density Estimate (KDE) curve

When you add the parameter kde=True in Seaborn’s histplot(), it overlays a Kernel Density Estimate (KDE) curve on top of the histogram.

This KDE curve is a smoothed representation of the data distribution — it estimates the probability density function (PDF) of the variable, helping you visualize where the data is concentrated without relying solely on discrete histogram bins.

Example Code

import seaborn as sns
import matplotlib.pyplot as plt

# Example dataset: House prices (in $1,000s)
house_prices = [120, 135, 150, 160, 165, 170, 175, 180, 185, 190, 195,
                200, 205, 210, 215, 220, 230, 250, 260, 275]

# Create histogram with KDE overlay
sns.histplot(house_prices, bins=8, kde=True, color='skyblue')

plt.xlabel('House Price ($1000s)')
plt.ylabel('Frequency')
plt.title('Distribution of House Prices with KDE Curve')
plt.show()

Histogram with KDE curve

Comparing Categories with Multiple Histograms

You can overlay histograms by category using Seaborn’s hue argument to see how distributions differ across groups.

Example Code (Overlayed Histogram)

sns.histplot(data=car_data, x='price', hue='body_style', element='step', stat='density', common_norm=False)
plt.title('Car Prices by Body Style')
plt.show()

Explanation:

hue='body_style' separates car prices by category.
element='step' draws outlined histograms for clarity.
stat='density' normalizes each category to compare shapes rather than counts.
common_norm=False ensures each category is scaled independently.

Using FacetGrid for Side-by-Side Comparison

You can also use FacetGrid to create small multiples—side-by-side plots for each category.

Example Code

g = sns.FacetGrid(car_data, col='body_style')
g.map_dataframe(sns.histplot, x='price')
g.set_titles("{col_name} cars")
plt.show()

Choosing Bin Width and Number: The IQR Rule of Thumb

Choosing the right number of bins (or their width) is important:

Too few bins: Oversimplifies and hides patterns
Too many bins: Overcomplicates and introduces noise

A useful rule of thumb, especially with larger datasets, incorporates the interquartile range (IQR).

Steps

1) Find the IQR of your data (difference between 75th and 25th percentiles).

import numpy as np

IQR = np.percentile(car_prices, 75) - np.percentile(car_prices, 25)

2) Calculate bin width:

bin_width = 2 * IQR / (len(car_prices) ** (1/3))

3) Determine the number of bins:

num_bins = round((max(car_prices) - min(car_prices)) / bin_width)

4) Use it in Seaborn:

sns.histplot(car_prices, bins=num_bins)
plt.show()

You can experiment with different bin sizes. The best visualization for reports or presentations may require tweaking these parameters.

Key Takeaways

Seaborn’s histplot provides an efficient way to visualize the distribution of numerical data.
Customization (bins, color, labels) improves data clarity for your audience.
Use the IQR-based rule of thumb for a data-driven approach to choosing bin sizes—but don’t hesitate to experiment.
The choice of bins/granularity can affect interpretability; select based on your analytic goals.

Boxplot

A boxplot (or box-and-whisker plot) is a graphical method for displaying the distribution of a dataset. It highlights the median, the spread of data (via quartiles), and points out potential outliers.

Components of a Boxplot

Median: Line inside the box representing the dataset’s middle value.
First (Q1) & Third (Q3) Quartile: Edges of the box; Q1 (25th percentile), Q3 (75th percentile).
Interquartile Range (IQR): IQR = Q3 − Q1; range of the middle 50% of the data.
Whiskers: Lines extending from the box to the min/max values within 1.5 × IQR from Q1/Q3.
Outliers: Points beyond 1.5 × IQR, shown as individual dots.

Example illustration:
Boxplot

How to Create Boxplots in Python

We use Seaborn, a Python library, to generate boxplots.
Here’s an example using a sample DataFrame with the curb_weight variable from an automobile dataset:

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Sample data
curb_weights = [2000, 2100, 2050, 1500, 3000, 1300, 2200, 2500, 1100, 2700, 2150, 2900,
                2300, 2400, 1250, 2600, 2350, 2450, 1550, 2650]
df = pd.DataFrame({'curb_weight': curb_weights})

# Create the boxplot
sns.boxplot(y='curb_weight', data=df, color='skyblue')
plt.title('Boxplot of Curb Weight')
plt.ylabel('Curb Weight')
plt.show()

This code produces a boxplot visualizing the curb_weight distribution, highlighting the median, quartiles, and any outliers.

Boxplot

Interpreting a Boxplot

Central Tendency: The median line inside the box shows where most data points lie.
Spread/Variability: The width of the box (IQR) indicates how spread out the data is.
Skewness:
- Right/Positive Skew: Right whisker is longer; median near Q1.
- Left/Negative Skew: Left whisker is longer; median near Q3.
Outliers: Shown as dots outside the whiskers—may indicate measurement errors or special cases.

Example:
If the median is close to Q1 and the right whisker is longer, the data is right-skewed (common in prices or incomes).

Comparing Multiple Boxplots

Boxplots enable comparison across categories.
Example: Compare car prices by body_style.

# Assuming df has 'body_style' and 'price'
sns.boxplot(x='body_style', y='price', data=df, palette='pastel')
plt.title('Vehicle Price Distribution by Body Style')
plt.xlabel('Body Style')
plt.ylabel('Price')
plt.show()

Limitations of Boxplots

No Insight into Modality: Boxplots don’t show if data has multiple peaks.
Doesn’t Show Distribution Shape: You can’t see the “shape” (e.g., normal, skewed) beyond quartiles/outliers.

Histograms: Seeing Shape & Frequency

A histogram shows the frequency of data points within specified intervals (bins). It’s ideal for spotting patterns such as skewness, spread, and multiple modes (peaks/valleys).

Example in Python

plt.hist(df['curb_weight'], bins=7, color='salmon', edgecolor='black')
plt.title('Histogram of Curb Weight')
plt.xlabel('Curb Weight')
plt.ylabel('Frequency')
plt.show()

This reveals how values are distributed across the range.

Boxplots vs. Histograms

Feature	Boxplot	Histogram
Shows Median	✅	❌
Quartiles	✅	❌
Outliers	✅	❌
Distribution Shape	Limited	✅
Frequency	❌	✅

Tip: Use boxplots for a quick statistical summary and histograms for detailed shape exploration.

Five-Number Summary & Outlier Detection

Five-Number Summary: Minimum, Q1, Median (Q2), Q3, Maximum
Boxplots depict these five key statistics visually.
IQR & Outliers: Outliers are data points beyond 1.5 × IQR above Q3 or below Q1.

These may need investigation or separate treatment during data preprocessing.

Key Takeaways

Boxplots are great for comparing distributions, spotting outliers, and summarizing spread.
Histograms provide deeper insights into data shape and modality.
Using both gives a complete statistical and visual understanding of your data—essential for data scientists and analysts.

Bar Graph

Bar graphs show counts or measurements for different categories. Each bar represents a category; its height shows the frequency or count. Use for: Comparing the number of items in each category. Before we get into Bar graphs, lets undestand Categorical vs Numerical datatypes

Categorical Data: Groups or types, such as “sedan,” “convertible,” or “hatchback.”
Numerical Data: Measured quantities like “price” or “count.”

Knowing this distinction helps you choose the correct visualization type.

Why Not Use a Histogram?

Histograms are for numerical data, showing how values are distributed across intervals (bins).
Bar graphs are for categorical data, showing counts of each group.

Example: Counting Car Body Styles

import pandas as pd

# Sample data
cars = [
    {'body_style':'sedan', 'fuel_type':'gasoline'},
    {'body_style':'convertible', 'fuel_type':'gasoline'},
    {'body_style':'sedan', 'fuel_type':'diesel'},
    {'body_style':'hatchback', 'fuel_type':'gasoline'},
]
df = pd.DataFrame(cars)

Basic Bar Graph: Count of Each Body Style

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='body_style')
plt.title('Count of Cars by Body Style')
plt.xlabel('Body Style')
plt.ylabel('Number of Cars')
plt.show()

countplot() is Seaborn’s built-in function for categorical bar graphs.

Sample Output: A simple bar graph showing car counts by body style.

Bargraph

Adding Depth: Using `hue` for Subcategories

You can visualize another categorical variable (like fuel type) within each body style.

plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='body_style', hue='fuel_type')
plt.title('Body Style by Fuel Type')
plt.xlabel('Body Style')
plt.ylabel('Count')
plt.show()

This produces side-by-side bars for, say, gasoline vs. diesel cars within each category.

Bargraph

Customizing Your Graphs

You can make your plots clearer and more presentation-ready using customization options.

Example: Enhanced Bar Graph

plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='body_style', hue='fuel_type')
plt.title('Distribution of Cars by Body Style and Fuel Type')
plt.xlabel('Body Style')
plt.ylabel('Count')
plt.xticks(rotation=20)
plt.ylim(0, 5)
plt.show()

Bargraph

Common Customizations:

Figure Size: plt.figure(figsize=(width, height))
Rotate Labels: plt.xticks(rotation=30) for readability
Axis Limits: plt.ylim(0, 15)
Titles/Labels: Use plt.title(), plt.xlabel(), plt.ylabel()

Visual Summary Table

Chart Type	Use Case	Data Type	Example Use
Bar Graph	Compare category frequencies	Categorical	Car body types
Histogram	Show value distribution in intervals	Numerical	Car prices, ages
Count Plot	Bar graph for count of each category	Categorical	Car make/model

Key Takeaways

Use bar graphs/count plots for categorical data; use histograms for numerical data.
Seaborn makes it easy to plot, compare, and customize your visualizations.
Customizing graphs improves clarity and impact, making your visuals more effective.

Line Plot

A line plot graphically represents data points connected by straight lines. It’s best suited for time series data, allowing you to quickly detect trends, cycles (like seasonality), and variability.

Why Use Line Plots?

Visualize time series data to track changes over time
Spot seasonality — repeating patterns (e.g., more airline passengers in summer)
Analyze trends — long-term upward or downward movements
Plot multiple lines to compare categories
Show confidence intervals to visualize variability or certainty in averages

Sample Data: Monthly Airline Passengers

Here’s a small artificial dataset mimicking historical airline passenger data to demonstrate line plot concepts.

# Sample 'flights' style dataset (Month-Year-Passengers)
import pandas as pd

data = {
    'year':  [1949]*6 + [1950]*6,
    'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']*2,
    'passengers': [112, 118, 132, 129, 121, 135,   # 1949
                   145, 150, 178, 163, 172, 185]   # 1950
}
df = pd.DataFrame(data)
df.head()

Basic Line Plot with Seaborn

Seaborn makes it easy to generate line plots, including confidence intervals by default.

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8,4))
sns.lineplot(data=df, x='month', y='passengers')
plt.title('Monthly Airline Passengers (Sample Data)')
plt.ylabel('Number of Passengers')
plt.xlabel('Month')
plt.show()

LinePlot

Visualizing Seasonality and Trends

Seasonality: Repeating patterns (e.g., more passengers in specific months)
Trend: Persistent increase/decrease across years

plt.figure(figsize=(10,5))
sns.lineplot(data=df, x='month', y='passengers', hue='year', marker='o')
plt.title('Passenger Counts: Seasonality & Trend by Year')
plt.ylabel('Number of Passengers')
plt.legend(title='Year')
plt.show()

Interpretation:
The pattern repeats each year (seasonality), but 1950 shows higher values overall (trend).

LinePlot

Customizing Line Plots

Seaborn supports multiple customization features to make multi-category plots clear.

plt.figure(figsize=(10,5))
sns.lineplot(
    data=df, x='month', y='passengers', 
    hue='year', style='year', markers=True, dashes=False
)
plt.title('Customized Line Plot: Hue and Style by Year')
plt.ylabel('Number of Passengers')
plt.show()

Parameters:

hue: Groups by categorical variable (year)
style: Differentiates line styles per group
markers=True: Adds data point markers

Confidence Intervals

Seaborn’s line plot can display confidence intervals—a shaded region around each line that shows variability or reliability.

# Simulate more categories for demonstration
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# df2 has columns: month, passengers, region  (one row per month×region)
# Create 30 replicated measurements with noise for each month×region
reps = []
for i in range(30):
    tmp = df2.copy()
    tmp["passengers"] = tmp["passengers"] + np.random.normal(0, 6, size=len(tmp))
    tmp["rep"] = i
    reps.append(tmp)
df_rep = pd.concat(reps, ignore_index=True)

plt.figure(figsize=(10,5))
sns.lineplot(
    data=df_rep, x="month", y="passengers", hue="region",
    estimator="mean", errorbar=("ci", 95)  # seaborn ≥0.12
)
plt.title("Line Plot with 95% CI by Region")
plt.ylabel("Number of Passengers")
plt.show()

The shaded band represents the 95% confidence interval, showing the range where most points likely fall.

LinePlot

Real-world Example: Functional MRI Data

Line plots are commonly used in neuroscience to show average signals over time across brain regions.

# Simulate average signal in two brain regions
df_fmri = pd.DataFrame({
    'timepoint': list(range(10)) + list(range(10)),
    'signal': [0.2, 0.3, 0.6, 0.8, 1.0, 1.2, 1.1, 0.9, 0.7, 0.5] +
              [0.1, 0.15, 0.3, 0.4, 0.35, 0.38, 0.32, 0.3, 0.18, 0.14],
    'region': ['Anterior']*10 + ['Posterior']*10
})

plt.figure(figsize=(8,4))
sns.lineplot(data=df_fmri, x='timepoint', y='signal', hue='region', marker='o')
plt.title('Functional MRI Signal Over Time')
plt.ylabel('Average Signal')
plt.xlabel('Time Point')
plt.show()

This visualization compares two brain regions over time, making signal trends and timing differences easy to spot.

LinePlot

Key Takeaways

Line plots are essential for time series visualization.
They reveal seasonal effects, long-term trends, and comparative patterns.
Seaborn provides intuitive options for customization and confidence intervals.
Widely used in fields like finance, healthcare, and neuroscience.

Univariate Analysis - Real Estate Data Analysis

In this section, we’ll put togther what we learned so far w.r.t univariate analysis and walk through an exploratory analysis of a real estate dataset—identifying outliers and trends for distance to CBD, land size, building area, price, number of rooms, and regional distribution.

Sample Data

Let’s create a small, representative synthetic dataset:

import pandas as pd

data = {
    "Distance": [5, 12, 18, 22, 30, 7, 60],  # km from CBD
    "Landsize": [400, 650, 1200, 55000, 620, 200, 70000], # sq meters
    "BuildingArea": [150, 250, 900, 300, 170, 1000, 80],  # sq meters
    "Price": [700000, 850000, 1250000, 1800000, 3200000, 600000, 510000], # AUD
    "Rooms": [3, 4, 2, 3, 8, 2, 7],
    "Regionname": [
        "Northern Metropolitan",
        "Southern Metropolitan",
        "Western Metropolitan",
        "Southern Metropolitan",
        "Eastern Metropolitan",
        "Western Metropolitan",
        "Northern Metropolitan"
    ]
}
df = pd.DataFrame(data)
df

Distance Analysis

How far are properties from the Central Business District (CBD)?

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(6,3))
sns.histplot(df['Distance'], bins=6, kde=False)
plt.title("Distribution of Distances from CBD (km)")
plt.xlabel("Distance (km)")
plt.ylabel("Count")
plt.show()

plt.figure(figsize=(4,3))
sns.boxplot(y=df['Distance'])
plt.title("Boxplot of Distance from CBD (km)")
plt.show()

Insights:

Most houses are under 20 km from CBD.
Properties >25 km are outliers but not necessarily errors.

Univariate Case Study

Landsize Analysis

Convert land size to thousands of square meters for better scaling.

df['Landsize_k'] = df['Landsize'] / 1000

plt.figure(figsize=(6,3))
sns.histplot(df['Landsize_k'], bins=7, kde=False)
plt.title("Distribution of Landsize (Thousands sq m)")
plt.xlabel("Landsize (1000 sq m)")
plt.ylabel("Count")
plt.show()

plt.figure(figsize=(4,3))
sns.boxplot(y=df['Landsize_k'])
plt.title("Boxplot of Landsize (1000 sq m)")
plt.show()

Note: Houses with landsize above 60,000 sq m are extremely rare and considered outliers.

Univariate Case Study

Building Area Analysis

plt.figure(figsize=(6,3))
sns.histplot(df["BuildingArea"], bins=6, kde=False)
plt.title("Building Area Distribution (sq m)")
plt.xlabel("Building Area (sq m)")
plt.ylabel("Count")
plt.show()

plt.figure(figsize=(4,3))
sns.boxplot(y=df["BuildingArea"])
plt.title("Boxplot: Building Area (sq m)")
plt.show()

Findings:

Most homes fall between 100–300 sq m.
Outliers (e.g., 900–1000 sq m) may represent luxury homes or data anomalies.

Price Analysis

plt.figure(figsize=(6,3))
sns.histplot(df["Price"], bins=6, kde=False)
plt.title("Distribution of Prices (AUD)")
plt.xlabel("Price (AUD)")
plt.ylabel("Count")
plt.show()

plt.figure(figsize=(4,3))
sns.boxplot(y=df["Price"])
plt.title("Boxplot: Prices (AUD)")
plt.show()

Takeaway:

Most properties are between $600K–$2M.
Prices above $2.5M are outliers worth investigating.

Room Count Analysis

plt.figure(figsize=(6,3))
sns.histplot(df["Rooms"], bins=6, discrete=True)
plt.title("Number of Rooms Distribution")
plt.xlabel("Rooms")
plt.ylabel("Count")
plt.show()

plt.figure(figsize=(4,3))
sns.boxplot(y=df["Rooms"])
plt.title("Boxplot: Number of Rooms")
plt.show()

Observation:

Median = 3 rooms.
Properties with 7–8 rooms are statistical outliers.

Region Analysis

plt.figure(figsize=(7,3))
sns.countplot(x=df["Regionname"])
plt.title("Properties by Region")
plt.xlabel("Region")
plt.ylabel("Count")
plt.xticks(rotation=30)
plt.show()

Points:

Western and Southern Metropolitan regions dominate this sample.
Region-level analysis reveals market activity and geographical spread.

Univariate Case Study

Conclusion

By systematically exploring distance from CBD, land size, building area, price, rooms, and region, we can:

Spot trends (e.g., city-centric housing markets)
Identify outliers (huge mansions, high prices)
Begin cleaning and refining data for predictive modeling

Visualization techniques for Bivariate and Multivariage analysis

Scatter Plot

A line plot compares trends over time or ordered categories. A scatter plot shows relationships (and possible correlations) between two variables.

Creating Sample Data

Let’s work with a sample automobile dataset including variables like engine size, horsepower, curb weight, bore, stroke, and fuel type.

import pandas as pd

# Sample automobile dataset
data = {
    'engine_size': [98, 120, 140, 160, 180, 200],
    'horsepower': [70, 88, 95, 110, 130, 145],
    'curb_weight': [2200, 2500, 2800, 3000, 3200, 3500],
    'bore': [3.0, 3.1, 3.2, 3.3, 3.5, 3.7],
    'stroke': [3.2, 3.2, 3.3, 3.4, 3.5, 3.5],
    'fuel_type': ['gasoline', 'diesel', 'gasoline', 'diesel', 'gasoline', 'diesel']
}

df = pd.DataFrame(data)
df

Line Plot: Comparing Two Numerical Variables

While often used for time series, a line plot can also compare two numerical columns side-by-side.

import matplotlib.pyplot as plt

plt.figure(figsize=(8,5))
plt.plot(df['engine_size'], df['horsepower'], marker='o')
plt.title('Engine Size vs Horsepower')
plt.xlabel('Engine Size')
plt.ylabel('Horsepower')
plt.grid(True)
plt.show()

Scatter plot

Scatter Plot: Visualizing Relationships

Scatter plots display pairs of values as dots, making patterns and relationships easy to spot.

plt.figure(figsize=(8,5))
plt.scatter(df['engine_size'], df['horsepower'])
plt.title('Engine Size vs Horsepower (Scatter Plot)')
plt.xlabel('Engine Size')
plt.ylabel('Horsepower')
plt.grid(True)
plt.show()

Scatter plot

Enhanced Scatter Plots: Using Color and Markers

Visuals can be made more insightful by using different colors or markers for categories like fuel type.

import seaborn as sns

plt.figure(figsize=(8,5))
sns.scatterplot(
    data=df,
    x='engine_size',
    y='horsepower',
    hue='fuel_type',
    style='fuel_type',
    s=100
)
plt.title('Engine Size vs Horsepower by Fuel Type')
plt.xlabel('Engine Size')
plt.ylabel('Horsepower')
plt.grid(True)
plt.show()

Scatter plot

Exploring Correlations

Scatter plots also reveal correlation:

Positive correlation: As one variable increases, so does the other.
Negative correlation: As one increases, the other decreases.
No correlation: No clear pattern.

Curb Weight vs Engine Size

plt.figure(figsize=(8,5))
plt.scatter(df['curb_weight'], df['engine_size'])
plt.title('Curb Weight vs Engine Size')
plt.xlabel('Curb Weight')
plt.ylabel('Engine Size')
plt.grid(True)
plt.show()

Observation: Typically shows strong positive correlation in cars.

Bore vs Engine Size

plt.figure(figsize=(8,5))
plt.scatter(df['bore'], df['engine_size'])
plt.title('Bore vs Engine Size')
plt.xlabel('Bore')
plt.ylabel('Engine Size')
plt.grid(True)
plt.show()

Observation: May not show a clear trend—an example of little or no correlation.

Quantifying Correlation

Let’s calculate correlation coefficients to quantify relationships.

corr_matrix = df[['engine_size', 'horsepower', 'curb_weight', 'bore']].corr()
corr_matrix

Scatter plot

Interpretation:

Values close to 1 → strong positive correlation
Values close to -1 → strong negative correlation
Values near 0 → little to no correlation

Key Takeaways

Line plots show trends between two numerical variables.
Scatter plots are ideal for visualizing relationships and correlation.
Color and marker types add category info without clutter.
Not all variable pairs are correlated—visual inspection + correlation coefficients give the full picture.

Pair Plot

A pair plot (also known as a scatterplot matrix) is a visualization that shows pairwise relationships between multiple numerical variables in a dataset. Pair plots combine scatter plots (for each variable pair) and histograms (on the diagonal)

Multivariate Data Visualization

Multivariate data visualization is about representing multiple numerical and categorical values from a dataset to understand their interplay. The main goal is to reveal patterns and correlations that can drive insightful analysis and decision-making.

Key Plot Types Explained

One-Dimensional Data

Description: Visualization of a single numerical variable.
Common Plot: Histogram — shows the distribution of that variable.

Categorical Data

Description: Visualization of a single variable with categorical values (e.g., car type).
Plots: Bar charts or count plots.

Two Numerical Values

Description: Assessing the relationship between two numeric variables.
Plot: Scatter Plot (often with or without regression line, e.g., LM plot).

Multiple Numerical Values

Challenge: Visualizing all pairwise relationships between many variables.
Solution: Pair plots combine scatter plots (for each variable pair) and histograms (on the diagonal).

Sample Dataset

We simulate a simple dataset with numerical and categorical variables.

import pandas as pd
import numpy as np

# Set reproducibility
np.random.seed(0)

# Create a DataFrame
data = pd.DataFrame({
    'Horsepower': np.random.normal(130, 30, 50),  # Simulated car horsepower
    'Weight': np.random.normal(3000, 400, 50),    # Simulated car weight in pounds
    'MPG': np.random.normal(25, 5, 50),           # Miles per gallon
    'Doors': np.random.choice(['Two', 'Four'], 50) # Car door type (categorical)
})

data.head()

Basic Pair Plot

Visualize all binary relationships among numerical variables.

import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(data)
plt.suptitle('Basic Pair Plot: Relationships & Distributions', y=1.02)
plt.show()

Pair plot

What to Notice:

Scatter plots (off-diagonal): Relationships between numerical variable pairs.
Histograms (diagonal): Distribution of each individual variable.

Pair Plot with Hue (Categorical Coloring)

Distinguish categories in scatter plots using color.

sns.pairplot(data, hue='Doors')
plt.suptitle('Pair Plot with Hue: Comparing Two- vs. Four-Door Cars', y=1.02)
plt.show()

Pair plot

Interpretation Tip:
By adding hue='Doors', you can quickly spot group differences—e.g., if two-door or four-door cars cluster differently in terms of weight or MPG.

Lower Triangular Matrix for Simplicity

Displaying only the lower triangle reduces visual clutter (since pair plots are symmetric).

sns.pairplot(data, hue='Doors', corner=True)
plt.suptitle('Optimized Pair Plot (Lower Triangle Only)', y=1.02)
plt.show()

Pair plot

Interpreting the Visualizations

Scatter Plots (Off-diagonal): Show how two variables move together; trends may indicate correlation (linear or otherwise).
Histograms (Diagonal): Reveal spread, skew, or outliers in individual variables.
Color (Hue): Highlights distribution/pattern of different categories, making it easier to discern group-based differences.
Lower Triangular Plot: Simplifies interpretation by removing duplicated information.

Summary Table

Technique	Use-Case	Visualization Type	Sample Code Snippet
Histogram	Distribution of a single variable	Numeric variable	`sns.histplot(data['MPG'])`
Scatter Plot	Relationship between two variables	Two numeric variables	`sns.scatterplot(...)`
Pair Plot	All pairwise relationships	Multiple numeric variables	`sns.pairplot(data)`
Pair Plot with Hue	Categorical subgroup analysis	Numeric + categorical	`sns.pairplot(data, hue=...)`
Lower Triangular Pair Plot	Reduce clutter, focus on unique pairs	Multiple, simplified	`sns.pairplot(..., corner=True)`

Key Takeaways

Pair plots are a versatile, powerful tool for exploratory data analysis (EDA), enabling you to grasp both individual distributions and inter-variable relationships at a glance.

Leveraging features like hue for categorical distinction and corner=True for clarity, pair plots can simplify your workflow and boost your data interpretation skills.

Heat Maps

What is a Correlation Matrix?

A correlation matrix is a table that shows correlation coefficients between variables. Each cell in the table displays the strength and direction of the relationship between two variables. This matrix forms the foundation for creating heat maps.

Creating a Correlation Matrix in Python

Let’s use a small, handcrafted dataset to demonstrate:

import pandas as pd

# Sample data
data = {
    'Age': [25, 32, 47, 51, 62],
    'Salary': [50000, 60000, 80000, 82000, 90000],
    'Experience': [1, 4, 15, 20, 29],
    'Satisfaction': [3.2, 3.8, 4.6, 4.4, 4.9]
}

df = pd.DataFrame(data)

# Compute correlation matrix
corr_matrix = df.corr()
print(corr_matrix)

Output (example):

	Age	Salary	Experience	Satisfaction
Age	1.00	0.95	0.98	0.88
Salary	0.95	1.00	0.94	0.89
Experience	0.98	0.94	1.00	0.85
Satisfaction	0.88	0.89	0.85	1.00

Visualizing with a Seaborn Heat Map

The Seaborn library allows us to create attractive heat maps easily.

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(7,5))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix Heat Map')
plt.show()

Heat map

A color map, where shades indicate the strength and direction of correlation.

Interpreting the Heat Map

Color Scale: Light colors represent higher correlations (closer to +1 or -1), while darker ones indicate weaker correlations (closer to 0).
Annotation: With annot=True, each cell directly shows the correlation coefficient.
Diagonal: Each variable’s correlation with itself is always 1.0.

Customizing Colors

You can adjust the color scheme via the cmap argument to match your theme or highlight certain ranges.

sns.heatmap(corr_matrix, annot=True, cmap='YlGnBu')
plt.title('Correlation Matrix Heat Map (Custom Colors)')
plt.show()

Heat map

Popular colormaps include: 'viridis', 'YlGnBu', 'magma', 'coolwarm'.

Why Use Heat Maps?

🎯 Pattern Recognition: Quickly spot strong correlations (positive or negative).
⚙️ Outlier Detection: Identify unexpected relationships or anomalies.
🧠 Feature Engineering: Detect redundant features for model optimization.

Complete Example: End-to-End in Google Colab

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Step 1: Create sample data
data = {
    'Age': [25, 32, 47, 51, 62],
    'Salary': [50000, 60000, 80000, 82000, 90000],
    'Experience': [1, 4, 15, 20, 29],
    'Satisfaction': [3.2, 3.8, 4.6, 4.4, 4.9]
}
df = pd.DataFrame(data)

# Step 2: Compute correlation matrix
corr_matrix = df.corr()
print('Correlation matrix:')
print(corr_matrix)

# Step 3: Plot heat map with annotations and custom color
plt.figure(figsize=(7,5))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix Heat Map')
plt.show()

Key Takeaways

Heat maps are a vital tool in a data scientist’s toolbox for visualizing how dataset features interact.
By following these steps, you can generate clear, informative heat maps for any dataset—helping you uncover trends and make smarter data-driven decisions.

Previous article Next article

Exploratory Data Analysis with Python: An Overview

Table of contents

Data Cleaning in Python

1. What is Exploratory Data Analysis (EDA)?

2. Sample Dataset Overview

3. Key Steps in EDA

🔹 Step 1: Sanity Checks

🔹 Step 2: Univariate Analysis

🔹 Step 3: Bivariate Analysis

🔹 Step 4: Handling Missing Values

🔹 Step 5: Detecting Outliers

4. Data Cleaning and Preparation

5. Summary Statistics and Categorical Analysis

Numerical Summary

Categorical Overview

6. Insights and Observations

🧠 Conclusion

Visualization techniques for Univariate analysis

Histogram

What is a Histogram?

Creating a Histogram with Seaborn

Example Code

Customizing the Histogram

Example Code

Histogram - Kernel Density Estimate (KDE) curve

Example Code

Comparing Categories with Multiple Histograms

Example Code (Overlayed Histogram)

Using FacetGrid for Side-by-Side Comparison

Example Code

Choosing Bin Width and Number: The IQR Rule of Thumb

Steps

Key Takeaways

Boxplot

Components of a Boxplot

How to Create Boxplots in Python

Interpreting a Boxplot

Comparing Multiple Boxplots

Limitations of Boxplots

Histograms: Seeing Shape & Frequency

Example in Python

Boxplots vs. Histograms

Five-Number Summary & Outlier Detection

Key Takeaways

Bar Graph

Why Not Use a Histogram?

Example: Counting Car Body Styles

Basic Bar Graph: Count of Each Body Style

Adding Depth: Using hue for Subcategories

Customizing Your Graphs

Example: Enhanced Bar Graph

Visual Summary Table

Key Takeaways

Line Plot

Why Use Line Plots?

Sample Data: Monthly Airline Passengers

Basic Line Plot with Seaborn

Visualizing Seasonality and Trends

Customizing Line Plots

Confidence Intervals

Real-world Example: Functional MRI Data

Key Takeaways

Univariate Analysis - Real Estate Data Analysis

Sample Data

Distance Analysis

Landsize Analysis

Building Area Analysis

Price Analysis

Room Count Analysis

Region Analysis

Conclusion

Visualization techniques for Bivariate and Multivariage analysis

Scatter Plot

Creating Sample Data

Line Plot: Comparing Two Numerical Variables

Scatter Plot: Visualizing Relationships

Enhanced Scatter Plots: Using Color and Markers

Exploring Correlations

Curb Weight vs Engine Size

Bore vs Engine Size

Adding Depth: Using `hue` for Subcategories