Foundations of Machine Learning: Inside Python’s Pandas Library

by Selwyn Davidraj Posted on October 01, 2025

In this blog, we will explore Pandas which is a powerful library in Python designed for data manipulation and analysis. If you’re diving into data science or want to automate data-related tasks, Pandas will be your go-to toolkit.

📘 Table of Contents

Series and Dataframes

What is Pandas?

Pandas is a powerful, open-source Python library that provides flexible and expressive data structures to work with structured (tabular) data — much like working with spreadsheets, but more programmatic and scalable.

✨ Key Features

Two main data structures: Series (1D) and DataFrame (2D)
Easy data reading and writing
Quick summary statistics and visualization
Seamless integration with NumPy

Importing Pandas

To start using Pandas, import it at the top of your script or notebook.
The standard import pattern is:

import pandas as pd

This allows you to use the concise pd prefix when calling Pandas functions.

Pandas Data Structures: Series and DataFrames

Series: A one-dimensional labeled array.
DataFrame: A two-dimensional table (like an Excel sheet or SQL table).

Creating a Pandas Series

Let’s create a simple list of drug prices, then convert it into a Series.

import pandas as pd

# Sample drug prices in USD
drug_prices = [13.5, 24.8, 7.99, 4.25]
drug_series = pd.Series(drug_prices, index=['Aspirin', 'Paracetamol', 'Ibuprofen', 'Cetirizine'])

print(drug_series)

Output:

Aspirin       13.50
Paracetamol   24.80
Ibuprofen      7.99
Cetirizine     4.25
dtype: float64

Manipulating Pandas Series

You can easily perform arithmetic on Series objects:

# Increase all drug prices by $2.50
updated_prices = drug_series + 2.50
print(updated_prices)

Output:

Aspirin       16.00
Paracetamol   27.30
Ibuprofen     10.49
Cetirizine     6.75
dtype: float64

Or update data directly:

# Replace data with a new set
new_prices = [15.0, 28.5, 8.39, 5.10]
drug_series[:] = new_prices
print(drug_series)

Creating Pandas DataFrames

A DataFrame is like a spreadsheet or SQL table. Let’s create one with fictional student grades:

# Sample student data
student_data = {
    "Name": ["Asha", "Ben", "Carla", "Dan"],
    "Grade": [88, 92, 77, 85]
}

students_df = pd.DataFrame(student_data)
print(students_df)

Output:

    Name  Grade
 Asha     88
  Ben     92
Carla     77
  Dan     85

Indexing and Labeling in Pandas

You can assign custom indices to DataFrames and easily access specific data.

# Assign new index labels
students_df.index = ['Student 1', 'Student 2', 'Student 3', 'Student 4']
print(students_df)

# Select the 'Grade' column
print(students_df['Grade'])

# Access row by label
print(students_df.loc['Student 2'])

Adding and Updating Values

You can easily add new columns or modify existing ones.

# Add a new column indicating pass/fail
students_df['Pass'] = students_df['Grade'] >= 80
print(students_df)

# Update a specific cell
students_df.at['Student 3', 'Grade'] = 80
print(students_df)

Working with Different Data Sources

Pandas can create DataFrames from various sources — dictionaries, lists, or NumPy arrays.

From a Dictionary

# Sample energy data
energy_data = {
    "City": ["New York", "London", "Tokyo"],
    "Consumption_kWh": [2500, 1800, 2700]
}
energy_df = pd.DataFrame(energy_data)
print(energy_df)

From a NumPy Array

import numpy as np

# Random dataset: 3 rows × 2 columns
array_data = np.random.rand(3, 2)
array_df = pd.DataFrame(array_data, columns=['Col1', 'Col2'])
print(array_df)

Accessing and Modifying Series

Creating a Sample Dataset

We’ll define operators and revenues as lists, then use them to create a Pandas Series.

# Sample data: Mobile operators and their 2020 revenues (in billions USD)
operators = ['AT&T', 'Verizon', 'T-Mobile', 'Sprint']
revenues = [171, 128, 68, 33]

# Creating a Pandas Series
revenue_series = pd.Series(data=revenues, index=operators)
print(revenue_series)

Output:

AT&T       171
Verizon    128
T-Mobile    68
Sprint      33
dtype: int64

Indexing and Accessing Values

Pandas Series can be indexed in two ways:

By numeric position (like a list)
By label (using the operator’s name)

Access by position:

# Accessing by position
print(revenue_series[0])  # Output: 171 (AT&T)

Access by label:

# Accessing by label
print(revenue_series['T-Mobile'])  # Output: 68

Slicing the Series

Pandas supports slicing, similar to Python lists or NumPy arrays.

By numeric index (end-exclusive):

# Slicing by position (index 1 up to but not including 3)
print(revenue_series[1:3])

Output:

Verizon    128
T-Mobile    68
dtype: int64

By label (end-inclusive):

# Slicing by label (from 'Verizon' to 'T-Mobile', inclusive)
print(revenue_series['Verizon':'T-Mobile'])

Output:

Verizon    128
T-Mobile    68
dtype: int64

Key Point:

Numeric slicing [start:end] → end-exclusive
Label slicing ['label1':'label2'] → end-inclusive

Accessing Multiple Items

You can select multiple non-contiguous entries by passing a list of indices or labels.

By position:

# Multiple entries by positions
print(revenue_series[[0, 2]])

Output:

AT&T       171
T-Mobile    68
dtype: int64

By label:

# Multiple entries by label
print(revenue_series[['AT&T', 'Sprint']])

Output:

AT&T     171
Sprint    33
dtype: int64

Full Sample Code (Copy/Paste for Google Colab)

import pandas as pd

# Sample data
operators = ['AT&T', 'Verizon', 'T-Mobile', 'Sprint']
revenues = [171, 128, 68, 33]

# Create Series
revenue_series = pd.Series(data=revenues, index=operators)

# Display Series
print("Full Series:")
print(revenue_series)

# Access by position
print("\nAT&T revenue (by position):")
print(revenue_series[0])

# Access by label
print("\nT-Mobile revenue (by label):")
print(revenue_series['T-Mobile'])

# Slicing by numeric index
print("\nSlice by numeric index [1:3]:")
print(revenue_series[1:3])

# Slicing by label
print("\nSlice by label ['Verizon':'T-Mobile']:")
print(revenue_series['Verizon':'T-Mobile'])

# Access multiple entries by position
print("\nMultiple entries by position [0, 2]:")
print(revenue_series[[0, 2]])

# Access multiple entries by label
print("\nMultiple entries by label ['AT&T', 'Sprint']:")
print(revenue_series[['AT&T', 'Sprint']])

Accessing DataFrames

Creating a Sample DataFrame

Let’s create a DataFrame representing simple customer transactions.

import pandas as pd

# Sample dataset as a dictionary
data = {
    'CustomerID': [101, 102, 103, 104],
    'Location': ['New York', 'Mumbai', 'Berlin', 'Tokyo'],
    'Gender': ['F', 'M', 'M', 'F'],
    'StoreType': ['Retail', 'Online', 'Retail', 'Online'],
    'Quantity': [2, 5, 3, 1],
    'Bill': [40.75, 120.00, 63.00, 20.50]
}

# Creating DataFrame
df = pd.DataFrame(data)
df

Output:

CustomerID	Location	Gender	StoreType	Quantity	Bill
101	New York	F	Retail	2	40.75
102	Mumbai	M	Online	5	120.00
103	Berlin	M	Retail	3	63.00
104	Tokyo	F	Online	1	20.50

Accessing Data in DataFrames

Access Columns

You can access a column using its name inside square brackets.

# Accessing the 'Location' column
print(df['Location'])

Output:

  New York
    Mumbai
    Berlin
     Tokyo
Name: Location, dtype: object

Access Rows by Index

You can use .iloc[] to access rows by their numeric index.

# Access the third row (index 2)
print(df.iloc[2])

Output:

CustomerID      103
Location      Berlin
Gender             M
StoreType      Retail
Quantity           3
Bill            63.0
Name: 2, dtype: object

Access Specific Values

Use .at[row, column] for fast scalar access.

# Access the 'Bill' for the second customer (index 1)
print(df.at[1, 'Bill'])

Output:

120.0

Slicing and Advanced Indexing

You can slice rows or reverse them, similar to Python lists.

# Get the first 3 rows
print(df[:3])

# Get every other row
print(df[::2])

# Reverse the DataFrame rows
print(df[::-1])

Modifying DataFrames

Adding a New Column

Let’s add a new column called Discount.

df['Discount'] = [5, 15, 0, 2]
print(df)

Output:

CustomerID	Location	Gender	StoreType	Quantity	Bill	Discount
101	New York	F	Retail	2	40.75	5
102	Mumbai	M	Online	5	120.00	15
103	Berlin	M	Retail	3	63.00	0
104	Tokyo	F	Online	1	20.50	2

Updating Values

You can update values conditionally using .loc[].

# Give a 10% discount to all 'Online' purchases
df.loc[df['StoreType'] == 'Online', 'Discount'] = 10
print(df)

Output:

CustomerID	Location	Gender	StoreType	Quantity	Bill	Discount
101	New York	F	Retail	2	40.75	5
102	Mumbai	M	Online	5	120.00	10
103	Berlin	M	Retail	3	63.00	0
104	Tokyo	F	Online	1	20.50	10

Comparing DataFrames and NumPy Arrays

Feature	NumPy Array	pandas DataFrame
Data types per column	Same type per array	Can differ per column
Access	By index only	By index or column name
Flexibility	Numeric focus	Works well with structured, real-world data

Accessing and Modifying DataFrames (loc and iloc)

Creating a Sample DataFrame

Let’s start by creating a simple DataFrame to work with:

import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 32],
    'City': ['New York', 'Paris', 'London', 'Berlin'],
    'Score': [85, 90, 78, 88]
}
df = pd.DataFrame(data)
print(df)

Output:

	Name	Age	City	Score
0	Alice	24	New York	85
1	Bob	27	Paris	90
2	Charlie	22	London	78
3	David	32	Berlin	88

Accessing Rows and Columns

Accessing Columns

# Accessing a column by name
print(df['Name'])

Output:

   Alice
     Bob
 Charlie
   David
Name: Name, dtype: object

Accessing Rows with Square Brackets

# Access a row by label or position
print(df.loc[1])    # Access by index label (row 1)
print(df.iloc[2])   # Access by integer position (row 2)

Using loc and iloc

.loc[] uses labels (row and column names).
.iloc[] uses integer indices (row and column positions).

Accessing Specific Entries

# Access entry using loc [row label, column name]
print(df.loc[0, 'City'])   # Output: New York

# Access entry using iloc [row index, column index]
print(df.iloc[2, 3])       # Output: 78

Accessing Multiple Rows/Columns

# Access multiple rows and columns
print(df.loc[[1, 3], ['Name', 'Score']])
print(df.iloc[[0, 2], [1, 2]])

Modifying Entries in DataFrame

You can update entries using .loc[] and .iloc[] combined with assignment.

# Modify a single value
df.loc[1, 'Score'] = 92

# Modify multiple values
df.loc[[0, 2], 'Score'] = [87, 80]

# Using iloc
df.iloc[3, 1] = 33   # Updates David's age
print(df)

Updated DataFrame:

	Name	Age	City	Score
0	Alice	24	New York	87
1	Bob	27	Paris	92
2	Charlie	22	London	80
3	David	33	Berlin	88

Visualizing Changes

Let’s see the score distribution as a simple bar chart using matplotlib.

import matplotlib.pyplot as plt

plt.bar(df['Name'], df['Score'], color='skyblue')
plt.xlabel('Name')
plt.ylabel('Score')
plt.title('Scores of Individuals')
plt.show()

This visualization helps confirm the updates made to our data.

Summary Table

Operation	Code Example	Description
Access a column	`df['Name']`	Get column as Series
Access a row by label	`df.loc[1]`	Returns Series
Access a row by index	`df.iloc[1]`	Returns Series
Access specific entry	`df.loc[2, 'City']`	Get Charlie’s city
Access multiple rows/columns	`df.loc[[0,2], ['Name', 'Score']]`	Subset rows/cols
Modify an entry with loc/iloc	`df.loc[1, 'Score'] = 92`	Updates Bob’s score

Some Takeaways

Use square brackets to quickly access rows or columns.
Use .loc[] for labels and .iloc[] for indices.
Both methods can select or modify entries efficiently.
Mastering .loc and .iloc is essential for interactive data exploration and data manipulation in Pandas.

Accessing and Modifying DataFrames (condition-based indexing)

Creating a Sample DataFrame

Let’s use a small DataFrame representing store purchases for easy experimentation.

import pandas as pd

data = {
    'customer_id': [1, 2, 3, 4, 5],
    'product': ['apple', 'banana', 'cherry', 'date', 'elderberry'],
    'quantity': [2, 1, 5, 3, 1],
    'price': [0.99, 0.59, 2.99, 1.49, 4.99]
}
df = pd.DataFrame(data)
df

Output:

customer_id	product	quantity	price
1	apple	2	0.99
2	banana	1	0.59
3	cherry	5	2.99
4	date	3	1.49
5	elderberry	1	4.99

Accessing & Modifying DataFrame Values

Access by Index and Labels

.loc: Access by label (row and column names).
.iloc: Access by position (row and column numbers).

# Using .loc (label)
print(df.loc[2, 'product'])  # Output: cherry

# Using .iloc (position)
print(df.iloc[0, 2])  # Output: 2 (quantity of first row)

Logical Indexing in DataFrames

Logical indexing lets you filter rows by specific conditions.

# Find all rows where quantity is greater than 1
filtered = df[df['quantity'] > 1]
print(filtered)

Output:

customer_id	product	quantity	price
1	apple	2	0.99
3	cherry	5	2.99
4	date	3	1.49

💡 Key Insight: Logical indexing is crucial for filtering and selecting subsets of data efficiently.

➕ Adding Columns

Creating new columns in a DataFrame is simple and intuitive.

# Add a 'rating' column
df['rating'] = [4.5, 4.0, 4.2, 4.7, 4.8]
print(df)

Output:

customer_id	product	quantity	price	rating
1	apple	2	0.99	4.5
2	banana	1	0.59	4.0
3	cherry	5	2.99	4.2
4	date	3	1.49	4.7
5	elderberry	1	4.99	4.8

➖ Dropping Columns

Use .drop() to remove columns. By default, it returns a new DataFrame.

# Drop the 'customer_id' column
df_dropped = df.drop('customer_id', axis=1)
print(df_dropped)

# Drop in place
df.drop('customer_id', axis=1, inplace=True)
print(df)

Dropping Rows

You can remove rows by their index number.

# Drop the row at index 1
df_dropped_row = df.drop(1, axis=0)
print(df_dropped_row)

Resetting Indices

After dropping rows, the index might be non-sequential. Reset it using .reset_index().

# Reset the index after dropping a row
df_reset = df.drop(1).reset_index(drop=True)
print(df_reset)

Copying DataFrames

To modify data without affecting the original, use .copy().

# Create a copy of the DataFrame
df_copy = df.copy()
# Changes to df_copy do NOT affect df

Recap

✅ Access DataFrame values using .loc and .iloc.
✅ Filter rows efficiently with logical indexing.
✅ Add new columns easily with direct assignment.
✅ Drop columns/rows with .drop(); use inplace=True to modify the original DataFrame.
✅ Reset indices when needed using .reset_index().
✅ Copy DataFrames safely using .copy() to prevent unintended changes.
✅ Visualize data quickly using matplotlib.

Combining DataFrames

In data science, you’ll often work with data spread across multiple sources. Combining these datasets efficiently is crucial for comprehensive analysis. Pandas, offers several flexible methods for merging, joining, and concatenating DataFrames.

Sample Data

We’ll use simple artificial datasets for demonstration.

# Sample customer data
customers = pd.DataFrame({
    'customerID': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Carol', 'David']
})

# Sample purchase data
purchases = pd.DataFrame({
    'customerID': [2, 4, 4, 5],
    'item': ['Book', 'Pen', 'Pencil', 'Notebook']
})

# Another DataFrame with different columns
rewards = pd.DataFrame({
    'customerID': [1, 2, 3, 5],
    'points': [100, 150, 200, 50]
})

print("Customers:\n", customers)
print("\nPurchases:\n", purchases)
print("\nRewards:\n", rewards)

Concatenation of DataFrames

Concatenation stacks DataFrames by rows (axis=0) or by columns (axis=1).

Concatenating by Rows

Useful when DataFrames share similar columns.

result_row = pd.concat([customers, rewards], ignore_index=True)
print(result_row)

Result:

customerID	name	points
1	Alice	NaN
2	Bob	NaN
3	Carol	NaN
4	David	NaN
1	NaN	100.0
2	NaN	150.0
3	NaN	200.0
5	NaN	50.0

💡 Note: Columns not shared between DataFrames are filled with NaN.

Concatenating by Columns

result_col = pd.concat([customers, rewards], axis=1)
print(result_col)

⚠️ Be careful: Row alignment can cause mismatched data if lengths differ.

Merging DataFrames

Merging aligns DataFrames on common column values (like SQL joins). Controlled by on and how parameters.

Inner Merge

Keeps only rows with matching values.

merged_inner = pd.merge(customers, purchases, on='customerID', how='inner')
print(merged_inner)

Result:

customerID	name	item
2	Bob	Book
4	David	Pen
4	David	Pencil

Outer Merge

Includes all rows from both DataFrames; fills missing with NaN.

merged_outer = pd.merge(customers, purchases, on='customerID', how='outer')
print(merged_outer)

Result:

customerID	name	item
1	Alice	NaN
2	Bob	Book
3	Carol	NaN
4	David	Pen
4	David	Pencil
5	NaN	Notebook

Left and Right Merges

Left Join: Keep all rows from customers, bring matching ones from purchases.
Right Join: Keep all rows from purchases, bring matching ones from customers.

merged_left = pd.merge(customers, purchases, on='customerID', how='left')
print("Left Merge:\n", merged_left)

merged_right = pd.merge(customers, purchases, on='customerID', how='right')
print("Right Merge:\n", merged_right)

Handling Column Name Conflicts

If both DataFrames share non-key column names, Pandas adds suffixes.

df1 = pd.DataFrame({'ID': [1, 2], 'score': [90, 85]})
df2 = pd.DataFrame({'ID': [1, 2], 'score': [80, 88]})

pd.merge(df1, df2, on='ID', suffixes=('_left', '_right'))

Result:

ID	score_left	score_right
1	90	80
2	85	88

Joining DataFrames (Index-Based)

.join() merges DataFrames using their index — handy when data is already indexed.

# Set index before joining
customers_indexed = customers.set_index('customerID')
rewards_indexed = rewards.set_index('customerID')

joined_df = customers_indexed.join(rewards_indexed, how='outer')
print(joined_df)

Result:

customerID	name	points
1	Alice	100.0
2	Bob	150.0
3	Carol	200.0
4	David	NaN
5	NaN	50.0

Key Takeaways

✅ Concatenation → Stack DataFrames (vertically or horizontally).
✅ Merging → Combine on column keys (like SQL joins).
✅ Joining → Merge on index, ideal for already indexed data.
✅ Suffixes help when merging overlapping column names.
✅ Always inspect alignment and NaN values after merging!

Saving and Loading Datasets

Sample Dataset

Let’s create a small CSV and Excel dataset to use for demonstrations.

Sample Data: `products.csv`

ProductID,ProductName,Category,Price,Stock
1001,Widget A,Widgets,25.5,120
1002,Widget B,Widgets,35.0,60
1003,Gadget X,Gadgets,99.99,25
1004,Doohickey Y,Doohickeys,7.95,200
1005,Gizmo Z,Gizmos,49.5,0

💡 You can easily create this file in Colab or locally using any text editor.

Creating and Viewing DataFrames

Let’s start by manually creating a DataFrame to understand the basics.

import pandas as pd

# Create a small DataFrame
data = {
    'ProductID': [1001, 1002, 1003],
    'ProductName': ['Widget A', 'Widget B', 'Gadget X'],
    'Category': ['Widgets', 'Widgets', 'Gadgets'],
    'Price': [25.5, 35.0, 99.99],
    'Stock': [120, 60, 25]
}

df = pd.DataFrame(data)
print(df)

Output:

ProductID	ProductName	Category	Price	Stock
1001	Widget A	Widgets	25.5	120
1002	Widget B	Widgets	35.0	60
1003	Gadget X	Gadgets	99.99	25

Loading Data from CSV and Excel

For larger datasets, it’s more efficient to load data from files.

Load from CSV

# Assuming 'products.csv' is uploaded in your working directory
df = pd.read_csv('products.csv')
print(df.head())

Load from Excel

# Create and save an Excel file
df.to_excel('products.xlsx', index=False)

# Read it back
df_excel = pd.read_excel('products.xlsx')
print(df_excel.head())

Google Colab: Working with Drive

If you’re using Google Colab, mount your Drive to access files easily.

from google.colab import drive
drive.mount('/content/drive')

# Update the path to your file in Drive
file_path = '/content/drive/My Drive/data/products.csv'
df = pd.read_csv(file_path)

Key Points:

Mount Google Drive for file access.
Adjust file paths according to your Drive’s folder structure.

Exploring DataFrames

Once loaded, you can explore and inspect your data using Pandas’ built-in methods.

# View first few rows
print(df.head())

# View info about columns and data types
print(df.info())

# Display summary statistics
print(df.describe())

Saving DataFrames

After data manipulation or analysis, save your DataFrame back to a file.

Save as CSV

df.to_csv('products_updated.csv', index=False)

Save as Excel

df.to_excel('products_updated.xlsx', index=False)

💡 Tip: Always set index=False to prevent Pandas from saving index values as a separate column.

Key Takeaways

✅ Create DataFrames manually or by loading data from CSV/Excel.
✅ Explore your data with head(), info(), and describe().
✅ Visualize data using matplotlib.
✅ Save results back to CSV or Excel for further use.
✅ When using Google Colab, remember to mount Google Drive for file access.

Pandas Functions

Setup and Sample Data

Let’s create a small DataFrame for our analysis.

import pandas as pd

# Sample data
data = {
    'stock': ['Apple', 'SNI', 'TJX', 'ZTS'] * 4,
    'date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'] * 4,
    'price': [175.00, 45.60, 74.59, 28.40, 178.25, 46.10, 74.59, 29.00,
              179.26, 46.10, 69.08, 30.00, 150.00, 46.10, 69.08, 32.00]
}

df = pd.DataFrame(data)
df

Sample Output:

stock	date	price
Apple	2023-01-01	175.00
SNI	2023-01-02	45.60
TJX	2023-01-03	74.59
ZTS	2023-01-04	28.40
…	…	…

Viewing DataFrame Rows

Quickly inspect your dataset using head() and tail():

# First five rows
df.head()

# Last five rows
df.tail()

Understanding Data Structure

Use Pandas utilities to get a summary of the dataset.

print("Shape:", df.shape)
print("Columns:", df.columns.tolist())
df.info()

Explanation:

shape: Returns number of rows and columns
columns: Lists column names
info(): Displays data types and non-null counts

Data Types in Pandas

Understanding column data types helps guide your analysis.

print(df.dtypes)

Expect text columns (object) for stock names and dates, and numerical (float64) for prices.

Extracting Columns & Performing Calculations

You can select a column using bracket notation (returns a Series).

prices = df['price']
print(prices.head())

Statistical Measures

Pandas makes it easy to compute descriptive statistics:

print("Min price:", prices.min())
print("Max price:", prices.max())
print("Mean price:", prices.mean())
print("Median price:", prices.median())
print("Mode price:", prices.mode().tolist())

Unique Values and Value Counts

Explore distinct values and their frequencies.

# Unique stock names
print(df['stock'].unique())

# Frequency of each stock
print(df['stock'].value_counts())

# Proportional frequency (percentage)
print(df['stock'].value_counts(normalize=True))

Handling Multiple Modes

Sometimes a column has more than one most frequent value.

modes = df['price'].mode()
print("Mode(s):", modes.tolist())

Saving & Loading DataFrames

Easily export and reload your data with Pandas.

# Save to CSV
df.to_csv('sample_stocks.csv', index=False)

# Load the saved file
df_loaded = pd.read_csv('sample_stocks.csv')
print(df_loaded.head())

Key Takaways

Using Pandas, you can efficiently analyze and summarize tabular data.

✅ Create and preview DataFrames
✅ Inspect structure and data types
✅ Extract and compute statistics
✅ Explore unique values and frequencies
✅ Save and reload datasets
✅ Visualize key metrics

Pandas Functions (groupby)

Sample Dataset

Let’s create a small fictional stock price dataset.

import pandas as pd

data = {
    'Stock': ['AAPL', 'AAPL', 'AAPL', 'SNI', 'SNI', 'TJX', 'TJX', 'TJX', 'ZTS', 'ZTS'],
    'Price': [110, 108, 109, 72, 70, 65, 66, 67, 44, 46]
}

df = pd.DataFrame(data)
df

Grouping Data and Calculating Mean & Median

Grouping is key to analyzing subsets of your data.
Let’s calculate the mean and median stock prices for each stock.

# Group by 'Stock' and calculate the mean price
mean_prices = df.groupby('Stock')['Price'].mean()
print("Mean Prices:\n", mean_prices)

# Group by 'Stock' and calculate the median price
median_prices = df.groupby('Stock')['Price'].median()
print("Median Prices:\n", median_prices)

Sample Output:

Mean Prices:
Stock
AAPL    109.0
SNI      71.0
TJX      66.0
ZTS      45.0
Name: Price, dtype: float64

Median Prices:
Stock
AAPL    109.0
SNI      71.0
TJX      66.0
ZTS      45.0
Name: Price, dtype: float64

Writing and Applying Custom Functions

You can define your own transformation functions and use the .apply() method to apply them to DataFrame columns.

# Custom function to increase prices by 10%
def increase_by_10_percent(x):
    return x * 1.10

# Apply the function to 'Price' and create a new column
df['Increased_Price'] = df['Price'].apply(increase_by_10_percent)
df

Output Preview:

Stock	Price	Increased_Price
AAPL	110	121.0
AAPL	108	118.8
AAPL	109	119.9
SNI	72	79.2
…	…	…

Sorting DataFrames

Sorting helps you rank or order your data — for example, finding the highest prices.

# Sort by 'Increased_Price' in descending order
df_sorted = df.sort_values(by='Increased_Price', ascending=False)
df_sorted

Output Preview:

Stock	Price	Increased_Price
AAPL	110	121.0
AAPL	109	119.9
…	…	…
ZTS	44	48.4

Key Takeaways

✅ Use groupby() to aggregate data
✅ Calculate mean and median by group
✅ Apply custom functions using .apply()
✅ Create new columns with transformed data
✅ Sort DataFrames based on computed columns

Pandas Functions (datetime)

Working with date and time information is essential in many data science projects. The pandas library provides powerful tools for datetime manipulation.

Converting string columns to datetime
Handling different date formats
Extracting date components (year, month, day)
Performing date comparisons and calculations

Sample Data

Let’s start by creating a small CSV file mimicking a real-world dataset:

# save this as sample_dates.csv
id,date_purchased
1,01/12/2023
2,15/02/2023
3,23/07/2022
4,28/12/2020
5,03/03/2021

This CSV has two columns: id and date_purchased, where date_purchased is a string in dd/mm/yyyy format.

Load and Inspect Data

import pandas as pd

# Create a DataFrame directly
data = {
    'id': [1, 2, 3, 4, 5],
    'date_purchased': ['01/12/2023', '15/02/2023', '23/07/2022', '28/12/2020', '03/03/2021']
}
df = pd.DataFrame(data)
print(df)

Converting Strings to Datetime

By default, pandas treats date columns as strings, preventing date-specific operations.
Use pd.to_datetime() to fix this:

# Convert the 'date_purchased' column to datetime
df['date_purchased'] = pd.to_datetime(df['date_purchased'], dayfirst=True)
print(df.dtypes)

💡 Note:

dayfirst=True tells pandas that the date format is DD/MM/YYYY, common outside the US.

Handling Different Date Formats

Data from multiple regions may have different formats. Specify format settings to avoid confusion.

# US format: MM/DD/YYYY
us_dates = pd.Series(['12/01/2023', '02/15/2023'])
pd.to_datetime(us_dates, dayfirst=False)

# European format: DD/MM/YYYY
eu_dates = pd.Series(['01/12/2023', '15/02/2023'])
pd.to_datetime(eu_dates, dayfirst=True)

Extracting Year, Month, and Day

Once converted, datetime columns let you extract useful components:

df['year'] = df['date_purchased'].dt.year
df['month'] = df['date_purchased'].dt.month
df['day'] = df['date_purchased'].dt.day
print(df)

Creating New Columns from Datetime

Now, year, month, and day columns can help with time-based analysis.

import matplotlib.pyplot as plt

# Visualizing purchases by year
df['year'].value_counts().sort_index().plot(kind='bar')
plt.title('Number of Purchases by Year')
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()

This produces a bar chart summarizing purchases per year.

Comparing Dates & Datetime Arithmetic

You can filter and compute date-based metrics easily.

# Compare: Which purchases are after Jan 1, 2022?
after_2022 = df[df['date_purchased'] > '2022-01-01']
print(after_2022)

# Calculate days since purchase from today
df['days_since_purchase'] = (pd.Timestamp('today') - df['date_purchased']).dt.days
print(df[['id', 'date_purchased', 'days_since_purchase']])

Key Takeaways

✅ Always convert date columns to pandas datetime type.
✅ Specify the date format using parameters like dayfirst to avoid misinterpretation.
✅ Easily extract year, month, day for deeper analysis.
✅ Perform date comparisons and calculations effortlessly.

💻 Full Example Notebook

You can copy this entire block into Google Colab and run it end-to-end.

import pandas as pd
import matplotlib.pyplot as plt

# Sample data
data = {
    'id': [1, 2, 3, 4, 5],
    'date_purchased': ['01/12/2023', '15/02/2023', '23/07/2022', '28/12/2020', '03/03/2021']
}
df = pd.DataFrame(data)

# Convert to datetime
df['date_purchased'] = pd.to_datetime(df['date_purchased'], dayfirst=True)

# Extract components
df['year'] = df['date_purchased'].dt.year
df['month'] = df['date_purchased'].dt.month
df['day'] = df['date_purchased'].dt.day

# Visualize purchases by year
df['year'].value_counts().sort_index().plot(kind='bar')
plt.title('Number of Purchases by Year')
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()

# Datetime arithmetic
df['days_since_purchase'] = (pd.Timestamp('today') - df['date_purchased']).dt.days
print(df)

Previous article Next article

Foundations of Machine Learning: Inside Python’s Pandas Library

📘 Table of Contents

Series and Dataframes

What is Pandas?

✨ Key Features

Importing Pandas

Pandas Data Structures: Series and DataFrames

Creating a Pandas Series

Manipulating Pandas Series

Creating Pandas DataFrames

Indexing and Labeling in Pandas

Adding and Updating Values

Working with Different Data Sources

From a Dictionary

From a NumPy Array

Accessing and Modifying Series

Creating a Sample Dataset

Indexing and Accessing Values

Access by position:

Access by label:

Slicing the Series

By numeric index (end-exclusive):

By label (end-inclusive):

Accessing Multiple Items

By position:

By label:

Full Sample Code (Copy/Paste for Google Colab)

Accessing DataFrames

Creating a Sample DataFrame

Accessing Data in DataFrames

Access Columns

Access Rows by Index

Access Specific Values

Slicing and Advanced Indexing

Modifying DataFrames

Adding a New Column

Updating Values

Comparing DataFrames and NumPy Arrays

Accessing and Modifying DataFrames (loc and iloc)

Creating a Sample DataFrame

Accessing Rows and Columns

Accessing Columns

Accessing Rows with Square Brackets

Using loc and iloc

Accessing Specific Entries

Accessing Multiple Rows/Columns

Modifying Entries in DataFrame

Visualizing Changes

Summary Table

Some Takeaways

Accessing and Modifying DataFrames (condition-based indexing)

Creating a Sample DataFrame

Accessing & Modifying DataFrame Values

Access by Index and Labels

Logical Indexing in DataFrames

➕ Adding Columns

➖ Dropping Columns

Dropping Rows

Resetting Indices

Copying DataFrames

Recap

Combining DataFrames

Sample Data

Concatenation of DataFrames

Concatenating by Rows

Concatenating by Columns

Merging DataFrames

Inner Merge

Outer Merge

Left and Right Merges

Handling Column Name Conflicts

Joining DataFrames (Index-Based)

Key Takeaways

Saving and Loading Datasets

Sample Dataset

Sample Data: products.csv

Creating and Viewing DataFrames

Loading Data from CSV and Excel

Load from CSV

Load from Excel

Sample Data: `products.csv`