Foundations of Machine Learning: Inside Python’s Pandas Library

by Selwyn Davidraj     Posted on October 01, 2025

Foundations of Machine Learning: Inside Python’s Pandas Library

In this blog, we will explore Pandas which is a powerful library in Python designed for data manipulation and analysis. If you’re diving into data science or want to automate data-related tasks, Pandas will be your go-to toolkit.


📘 Table of Contents


Series and Dataframes

What is Pandas?

Pandas is a powerful, open-source Python library that provides flexible and expressive data structures to work with structured (tabular) data — much like working with spreadsheets, but more programmatic and scalable.

✨ Key Features

  • Two main data structures: Series (1D) and DataFrame (2D)
  • Easy data reading and writing
  • Quick summary statistics and visualization
  • Seamless integration with NumPy

Importing Pandas

To start using Pandas, import it at the top of your script or notebook.
The standard import pattern is:

import pandas as pd

This allows you to use the concise pd prefix when calling Pandas functions.

Pandas Data Structures: Series and DataFrames

  • Series: A one-dimensional labeled array.
  • DataFrame: A two-dimensional table (like an Excel sheet or SQL table).

Creating a Pandas Series

Let’s create a simple list of drug prices, then convert it into a Series.

import pandas as pd

# Sample drug prices in USD
drug_prices = [13.5, 24.8, 7.99, 4.25]
drug_series = pd.Series(drug_prices, index=['Aspirin', 'Paracetamol', 'Ibuprofen', 'Cetirizine'])

print(drug_series)

Output:

Aspirin       13.50
Paracetamol   24.80
Ibuprofen      7.99
Cetirizine     4.25
dtype: float64

Manipulating Pandas Series

You can easily perform arithmetic on Series objects:

# Increase all drug prices by $2.50
updated_prices = drug_series + 2.50
print(updated_prices)

Output:

Aspirin       16.00
Paracetamol   27.30
Ibuprofen     10.49
Cetirizine     6.75
dtype: float64

Or update data directly:

# Replace data with a new set
new_prices = [15.0, 28.5, 8.39, 5.10]
drug_series[:] = new_prices
print(drug_series)

Creating Pandas DataFrames

A DataFrame is like a spreadsheet or SQL table. Let’s create one with fictional student grades:

# Sample student data
student_data = {
    "Name": ["Asha", "Ben", "Carla", "Dan"],
    "Grade": [88, 92, 77, 85]
}

students_df = pd.DataFrame(student_data)
print(students_df)

Output:

    Name  Grade
0   Asha     88
1    Ben     92
2  Carla     77
3    Dan     85

Indexing and Labeling in Pandas

You can assign custom indices to DataFrames and easily access specific data.

# Assign new index labels
students_df.index = ['Student 1', 'Student 2', 'Student 3', 'Student 4']
print(students_df)

# Select the 'Grade' column
print(students_df['Grade'])

# Access row by label
print(students_df.loc['Student 2'])

Adding and Updating Values

You can easily add new columns or modify existing ones.

# Add a new column indicating pass/fail
students_df['Pass'] = students_df['Grade'] >= 80
print(students_df)

# Update a specific cell
students_df.at['Student 3', 'Grade'] = 80
print(students_df)

Working with Different Data Sources

Pandas can create DataFrames from various sources — dictionaries, lists, or NumPy arrays.

From a Dictionary

# Sample energy data
energy_data = {
    "City": ["New York", "London", "Tokyo"],
    "Consumption_kWh": [2500, 1800, 2700]
}
energy_df = pd.DataFrame(energy_data)
print(energy_df)

From a NumPy Array

import numpy as np

# Random dataset: 3 rows × 2 columns
array_data = np.random.rand(3, 2)
array_df = pd.DataFrame(array_data, columns=['Col1', 'Col2'])
print(array_df)

Accessing and Modifying Series

Creating a Sample Dataset

We’ll define operators and revenues as lists, then use them to create a Pandas Series.

# Sample data: Mobile operators and their 2020 revenues (in billions USD)
operators = ['AT&T', 'Verizon', 'T-Mobile', 'Sprint']
revenues = [171, 128, 68, 33]

# Creating a Pandas Series
revenue_series = pd.Series(data=revenues, index=operators)
print(revenue_series)

Output:

AT&T       171
Verizon    128
T-Mobile    68
Sprint      33
dtype: int64

Indexing and Accessing Values

Pandas Series can be indexed in two ways:

  • By numeric position (like a list)
  • By label (using the operator’s name)

Access by position:

# Accessing by position
print(revenue_series[0])  # Output: 171 (AT&T)

Access by label:

# Accessing by label
print(revenue_series['T-Mobile'])  # Output: 68

Slicing the Series

Pandas supports slicing, similar to Python lists or NumPy arrays.

By numeric index (end-exclusive):

# Slicing by position (index 1 up to but not including 3)
print(revenue_series[1:3])

Output:

Verizon    128
T-Mobile    68
dtype: int64

By label (end-inclusive):

# Slicing by label (from 'Verizon' to 'T-Mobile', inclusive)
print(revenue_series['Verizon':'T-Mobile'])

Output:

Verizon    128
T-Mobile    68
dtype: int64

Key Point:

  • Numeric slicing [start:end]end-exclusive
  • Label slicing ['label1':'label2']end-inclusive

Accessing Multiple Items

You can select multiple non-contiguous entries by passing a list of indices or labels.

By position:

# Multiple entries by positions
print(revenue_series[[0, 2]])

Output:

AT&T       171
T-Mobile    68
dtype: int64

By label:

# Multiple entries by label
print(revenue_series[['AT&T', 'Sprint']])

Output:

AT&T     171
Sprint    33
dtype: int64

Full Sample Code (Copy/Paste for Google Colab)

import pandas as pd

# Sample data
operators = ['AT&T', 'Verizon', 'T-Mobile', 'Sprint']
revenues = [171, 128, 68, 33]

# Create Series
revenue_series = pd.Series(data=revenues, index=operators)

# Display Series
print("Full Series:")
print(revenue_series)

# Access by position
print("\nAT&T revenue (by position):")
print(revenue_series[0])

# Access by label
print("\nT-Mobile revenue (by label):")
print(revenue_series['T-Mobile'])

# Slicing by numeric index
print("\nSlice by numeric index [1:3]:")
print(revenue_series[1:3])

# Slicing by label
print("\nSlice by label ['Verizon':'T-Mobile']:")
print(revenue_series['Verizon':'T-Mobile'])

# Access multiple entries by position
print("\nMultiple entries by position [0, 2]:")
print(revenue_series[[0, 2]])

# Access multiple entries by label
print("\nMultiple entries by label ['AT&T', 'Sprint']:")
print(revenue_series[['AT&T', 'Sprint']])

Accessing DataFrames

Creating a Sample DataFrame

Let’s create a DataFrame representing simple customer transactions.

import pandas as pd

# Sample dataset as a dictionary
data = {
    'CustomerID': [101, 102, 103, 104],
    'Location': ['New York', 'Mumbai', 'Berlin', 'Tokyo'],
    'Gender': ['F', 'M', 'M', 'F'],
    'StoreType': ['Retail', 'Online', 'Retail', 'Online'],
    'Quantity': [2, 5, 3, 1],
    'Bill': [40.75, 120.00, 63.00, 20.50]
}

# Creating DataFrame
df = pd.DataFrame(data)
df

Output:

CustomerID Location Gender StoreType Quantity Bill
101 New York F Retail 2 40.75
102 Mumbai M Online 5 120.00
103 Berlin M Retail 3 63.00
104 Tokyo F Online 1 20.50

Accessing Data in DataFrames

Access Columns

You can access a column using its name inside square brackets.

# Accessing the 'Location' column
print(df['Location'])

Output:

0    New York
1      Mumbai
2      Berlin
3       Tokyo
Name: Location, dtype: object

Access Rows by Index

You can use .iloc[] to access rows by their numeric index.

# Access the third row (index 2)
print(df.iloc[2])

Output:

CustomerID      103
Location      Berlin
Gender             M
StoreType      Retail
Quantity           3
Bill            63.0
Name: 2, dtype: object

Access Specific Values

Use .at[row, column] for fast scalar access.

# Access the 'Bill' for the second customer (index 1)
print(df.at[1, 'Bill'])

Output:

120.0

Slicing and Advanced Indexing

You can slice rows or reverse them, similar to Python lists.

# Get the first 3 rows
print(df[:3])

# Get every other row
print(df[::2])

# Reverse the DataFrame rows
print(df[::-1])

Modifying DataFrames

Adding a New Column

Let’s add a new column called Discount.

df['Discount'] = [5, 15, 0, 2]
print(df)

Output:

CustomerID Location Gender StoreType Quantity Bill Discount
101 New York F Retail 2 40.75 5
102 Mumbai M Online 5 120.00 15
103 Berlin M Retail 3 63.00 0
104 Tokyo F Online 1 20.50 2

Updating Values

You can update values conditionally using .loc[].

# Give a 10% discount to all 'Online' purchases
df.loc[df['StoreType'] == 'Online', 'Discount'] = 10
print(df)

Output:

CustomerID Location Gender StoreType Quantity Bill Discount
101 New York F Retail 2 40.75 5
102 Mumbai M Online 5 120.00 10
103 Berlin M Retail 3 63.00 0
104 Tokyo F Online 1 20.50 10

Comparing DataFrames and NumPy Arrays

Feature NumPy Array pandas DataFrame
Data types per column Same type per array Can differ per column
Access By index only By index or column name
Flexibility Numeric focus Works well with structured, real-world data

Accessing and Modifying DataFrames (loc and iloc)

Creating a Sample DataFrame

Let’s start by creating a simple DataFrame to work with:

import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 32],
    'City': ['New York', 'Paris', 'London', 'Berlin'],
    'Score': [85, 90, 78, 88]
}
df = pd.DataFrame(data)
print(df)

Output:

  Name Age City Score
0 Alice 24 New York 85
1 Bob 27 Paris 90
2 Charlie 22 London 78
3 David 32 Berlin 88

Accessing Rows and Columns

Accessing Columns

# Accessing a column by name
print(df['Name'])

Output:

0     Alice
1       Bob
2   Charlie
3     David
Name: Name, dtype: object

Accessing Rows with Square Brackets

# Access a row by label or position
print(df.loc[1])    # Access by index label (row 1)
print(df.iloc[2])   # Access by integer position (row 2)

Using loc and iloc

  • .loc[] uses labels (row and column names).
  • .iloc[] uses integer indices (row and column positions).

Accessing Specific Entries

# Access entry using loc [row label, column name]
print(df.loc[0, 'City'])   # Output: New York

# Access entry using iloc [row index, column index]
print(df.iloc[2, 3])       # Output: 78

Accessing Multiple Rows/Columns

# Access multiple rows and columns
print(df.loc[[1, 3], ['Name', 'Score']])
print(df.iloc[[0, 2], [1, 2]])

Modifying Entries in DataFrame

You can update entries using .loc[] and .iloc[] combined with assignment.

# Modify a single value
df.loc[1, 'Score'] = 92

# Modify multiple values
df.loc[[0, 2], 'Score'] = [87, 80]

# Using iloc
df.iloc[3, 1] = 33   # Updates David's age
print(df)

Updated DataFrame:

  Name Age City Score
0 Alice 24 New York 87
1 Bob 27 Paris 92
2 Charlie 22 London 80
3 David 33 Berlin 88

Visualizing Changes

Let’s see the score distribution as a simple bar chart using matplotlib.

import matplotlib.pyplot as plt

plt.bar(df['Name'], df['Score'], color='skyblue')
plt.xlabel('Name')
plt.ylabel('Score')
plt.title('Scores of Individuals')
plt.show()

This visualization helps confirm the updates made to our data.

Summary Table

Operation Code Example Description
Access a column df['Name'] Get column as Series
Access a row by label df.loc[1] Returns Series
Access a row by index df.iloc[1] Returns Series
Access specific entry df.loc[2, 'City'] Get Charlie’s city
Access multiple rows/columns df.loc[[0,2], ['Name', 'Score']] Subset rows/cols
Modify an entry with loc/iloc df.loc[1, 'Score'] = 92 Updates Bob’s score

Some Takeaways

  • Use square brackets to quickly access rows or columns.
  • Use .loc[] for labels and .iloc[] for indices.
  • Both methods can select or modify entries efficiently.
  • Mastering .loc and .iloc is essential for interactive data exploration and data manipulation in Pandas.

Accessing and Modifying DataFrames (condition-based indexing)

Creating a Sample DataFrame

Let’s use a small DataFrame representing store purchases for easy experimentation.

import pandas as pd

data = {
    'customer_id': [1, 2, 3, 4, 5],
    'product': ['apple', 'banana', 'cherry', 'date', 'elderberry'],
    'quantity': [2, 1, 5, 3, 1],
    'price': [0.99, 0.59, 2.99, 1.49, 4.99]
}
df = pd.DataFrame(data)
df

Output:

customer_id product quantity price
1 apple 2 0.99
2 banana 1 0.59
3 cherry 5 2.99
4 date 3 1.49
5 elderberry 1 4.99

Accessing & Modifying DataFrame Values

Access by Index and Labels

  • .loc: Access by label (row and column names).
  • .iloc: Access by position (row and column numbers).
# Using .loc (label)
print(df.loc[2, 'product'])  # Output: cherry

# Using .iloc (position)
print(df.iloc[0, 2])  # Output: 2 (quantity of first row)

Logical Indexing in DataFrames

Logical indexing lets you filter rows by specific conditions.

# Find all rows where quantity is greater than 1
filtered = df[df['quantity'] > 1]
print(filtered)

Output:

customer_id product quantity price
1 apple 2 0.99
3 cherry 5 2.99
4 date 3 1.49

💡 Key Insight: Logical indexing is crucial for filtering and selecting subsets of data efficiently.

➕ Adding Columns

Creating new columns in a DataFrame is simple and intuitive.

# Add a 'rating' column
df['rating'] = [4.5, 4.0, 4.2, 4.7, 4.8]
print(df)

Output:

customer_id product quantity price rating
1 apple 2 0.99 4.5
2 banana 1 0.59 4.0
3 cherry 5 2.99 4.2
4 date 3 1.49 4.7
5 elderberry 1 4.99 4.8

➖ Dropping Columns

Use .drop() to remove columns. By default, it returns a new DataFrame.

# Drop the 'customer_id' column
df_dropped = df.drop('customer_id', axis=1)
print(df_dropped)

# Drop in place
df.drop('customer_id', axis=1, inplace=True)
print(df)

Dropping Rows

You can remove rows by their index number.

# Drop the row at index 1
df_dropped_row = df.drop(1, axis=0)
print(df_dropped_row)

Resetting Indices

After dropping rows, the index might be non-sequential. Reset it using .reset_index().

# Reset the index after dropping a row
df_reset = df.drop(1).reset_index(drop=True)
print(df_reset)

Copying DataFrames

To modify data without affecting the original, use .copy().

# Create a copy of the DataFrame
df_copy = df.copy()
# Changes to df_copy do NOT affect df

Recap

Access DataFrame values using .loc and .iloc.
Filter rows efficiently with logical indexing.
Add new columns easily with direct assignment.
Drop columns/rows with .drop(); use inplace=True to modify the original DataFrame.
Reset indices when needed using .reset_index().
Copy DataFrames safely using .copy() to prevent unintended changes.
Visualize data quickly using matplotlib.


Combining DataFrames

In data science, you’ll often work with data spread across multiple sources. Combining these datasets efficiently is crucial for comprehensive analysis. Pandas, offers several flexible methods for merging, joining, and concatenating DataFrames.

Sample Data

We’ll use simple artificial datasets for demonstration.

# Sample customer data
customers = pd.DataFrame({
    'customerID': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Carol', 'David']
})

# Sample purchase data
purchases = pd.DataFrame({
    'customerID': [2, 4, 4, 5],
    'item': ['Book', 'Pen', 'Pencil', 'Notebook']
})

# Another DataFrame with different columns
rewards = pd.DataFrame({
    'customerID': [1, 2, 3, 5],
    'points': [100, 150, 200, 50]
})

print("Customers:\n", customers)
print("\nPurchases:\n", purchases)
print("\nRewards:\n", rewards)

Concatenation of DataFrames

Concatenation stacks DataFrames by rows (axis=0) or by columns (axis=1).

Concatenating by Rows

Useful when DataFrames share similar columns.

result_row = pd.concat([customers, rewards], ignore_index=True)
print(result_row)

Result:

customerID name points
1 Alice NaN
2 Bob NaN
3 Carol NaN
4 David NaN
1 NaN 100.0
2 NaN 150.0
3 NaN 200.0
5 NaN 50.0

💡 Note: Columns not shared between DataFrames are filled with NaN.

Concatenating by Columns

result_col = pd.concat([customers, rewards], axis=1)
print(result_col)

⚠️ Be careful: Row alignment can cause mismatched data if lengths differ.

Merging DataFrames

Merging aligns DataFrames on common column values (like SQL joins). Controlled by on and how parameters.

Inner Merge

Keeps only rows with matching values.

merged_inner = pd.merge(customers, purchases, on='customerID', how='inner')
print(merged_inner)

Result:

customerID name item
2 Bob Book
4 David Pen
4 David Pencil

Outer Merge

Includes all rows from both DataFrames; fills missing with NaN.

merged_outer = pd.merge(customers, purchases, on='customerID', how='outer')
print(merged_outer)

Result:

customerID name item
1 Alice NaN
2 Bob Book
3 Carol NaN
4 David Pen
4 David Pencil
5 NaN Notebook

Left and Right Merges

  • Left Join: Keep all rows from customers, bring matching ones from purchases.
  • Right Join: Keep all rows from purchases, bring matching ones from customers.
merged_left = pd.merge(customers, purchases, on='customerID', how='left')
print("Left Merge:\n", merged_left)

merged_right = pd.merge(customers, purchases, on='customerID', how='right')
print("Right Merge:\n", merged_right)

Handling Column Name Conflicts

If both DataFrames share non-key column names, Pandas adds suffixes.

df1 = pd.DataFrame({'ID': [1, 2], 'score': [90, 85]})
df2 = pd.DataFrame({'ID': [1, 2], 'score': [80, 88]})

pd.merge(df1, df2, on='ID', suffixes=('_left', '_right'))

Result:

ID score_left score_right
1 90 80
2 85 88

Joining DataFrames (Index-Based)

.join() merges DataFrames using their index — handy when data is already indexed.

# Set index before joining
customers_indexed = customers.set_index('customerID')
rewards_indexed = rewards.set_index('customerID')

joined_df = customers_indexed.join(rewards_indexed, how='outer')
print(joined_df)

Result:

customerID name points
1 Alice 100.0
2 Bob 150.0
3 Carol 200.0
4 David NaN
5 NaN 50.0

Key Takeaways

Concatenation → Stack DataFrames (vertically or horizontally).
Merging → Combine on column keys (like SQL joins).
Joining → Merge on index, ideal for already indexed data.
Suffixes help when merging overlapping column names.
✅ Always inspect alignment and NaN values after merging!


Saving and Loading Datasets

Sample Dataset

Let’s create a small CSV and Excel dataset to use for demonstrations.

Sample Data: products.csv

ProductID,ProductName,Category,Price,Stock
1001,Widget A,Widgets,25.5,120
1002,Widget B,Widgets,35.0,60
1003,Gadget X,Gadgets,99.99,25
1004,Doohickey Y,Doohickeys,7.95,200
1005,Gizmo Z,Gizmos,49.5,0

💡 You can easily create this file in Colab or locally using any text editor.

Creating and Viewing DataFrames

Let’s start by manually creating a DataFrame to understand the basics.

import pandas as pd

# Create a small DataFrame
data = {
    'ProductID': [1001, 1002, 1003],
    'ProductName': ['Widget A', 'Widget B', 'Gadget X'],
    'Category': ['Widgets', 'Widgets', 'Gadgets'],
    'Price': [25.5, 35.0, 99.99],
    'Stock': [120, 60, 25]
}

df = pd.DataFrame(data)
print(df)

Output:

ProductID ProductName Category Price Stock
1001 Widget A Widgets 25.5 120
1002 Widget B Widgets 35.0 60
1003 Gadget X Gadgets 99.99 25

Loading Data from CSV and Excel

For larger datasets, it’s more efficient to load data from files.

Load from CSV

# Assuming 'products.csv' is uploaded in your working directory
df = pd.read_csv('products.csv')
print(df.head())

Load from Excel

# Create and save an Excel file
df.to_excel('products.xlsx', index=False)

# Read it back
df_excel = pd.read_excel('products.xlsx')
print(df_excel.head())

Google Colab: Working with Drive

If you’re using Google Colab, mount your Drive to access files easily.

from google.colab import drive
drive.mount('/content/drive')

# Update the path to your file in Drive
file_path = '/content/drive/My Drive/data/products.csv'
df = pd.read_csv(file_path)

Key Points:

  • Mount Google Drive for file access.
  • Adjust file paths according to your Drive’s folder structure.

Exploring DataFrames

Once loaded, you can explore and inspect your data using Pandas’ built-in methods.

# View first few rows
print(df.head())

# View info about columns and data types
print(df.info())

# Display summary statistics
print(df.describe())

Saving DataFrames

After data manipulation or analysis, save your DataFrame back to a file.

Save as CSV

df.to_csv('products_updated.csv', index=False)

Save as Excel

df.to_excel('products_updated.xlsx', index=False)

💡 Tip: Always set index=False to prevent Pandas from saving index values as a separate column.

Key Takeaways

Create DataFrames manually or by loading data from CSV/Excel.
Explore your data with head(), info(), and describe().
Visualize data using matplotlib.
Save results back to CSV or Excel for further use.
✅ When using Google Colab, remember to mount Google Drive for file access.


Pandas Functions

Setup and Sample Data

Let’s create a small DataFrame for our analysis.

import pandas as pd

# Sample data
data = {
    'stock': ['Apple', 'SNI', 'TJX', 'ZTS'] * 4,
    'date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'] * 4,
    'price': [175.00, 45.60, 74.59, 28.40, 178.25, 46.10, 74.59, 29.00,
              179.26, 46.10, 69.08, 30.00, 150.00, 46.10, 69.08, 32.00]
}

df = pd.DataFrame(data)
df

Sample Output:

stock date price
Apple 2023-01-01 175.00
SNI 2023-01-02 45.60
TJX 2023-01-03 74.59
ZTS 2023-01-04 28.40

Viewing DataFrame Rows

Quickly inspect your dataset using head() and tail():

# First five rows
df.head()

# Last five rows
df.tail()

Understanding Data Structure

Use Pandas utilities to get a summary of the dataset.

print("Shape:", df.shape)
print("Columns:", df.columns.tolist())
df.info()

Explanation:

  • shape: Returns number of rows and columns
  • columns: Lists column names
  • info(): Displays data types and non-null counts

Data Types in Pandas

Understanding column data types helps guide your analysis.

print(df.dtypes)

Expect text columns (object) for stock names and dates, and numerical (float64) for prices.

Extracting Columns & Performing Calculations

You can select a column using bracket notation (returns a Series).

prices = df['price']
print(prices.head())

Statistical Measures

Pandas makes it easy to compute descriptive statistics:

print("Min price:", prices.min())
print("Max price:", prices.max())
print("Mean price:", prices.mean())
print("Median price:", prices.median())
print("Mode price:", prices.mode().tolist())

Unique Values and Value Counts

Explore distinct values and their frequencies.

# Unique stock names
print(df['stock'].unique())

# Frequency of each stock
print(df['stock'].value_counts())

# Proportional frequency (percentage)
print(df['stock'].value_counts(normalize=True))

Handling Multiple Modes

Sometimes a column has more than one most frequent value.

modes = df['price'].mode()
print("Mode(s):", modes.tolist())

Saving & Loading DataFrames

Easily export and reload your data with Pandas.

# Save to CSV
df.to_csv('sample_stocks.csv', index=False)

# Load the saved file
df_loaded = pd.read_csv('sample_stocks.csv')
print(df_loaded.head())

Key Takaways

Using Pandas, you can efficiently analyze and summarize tabular data.

✅ Create and preview DataFrames
✅ Inspect structure and data types
✅ Extract and compute statistics
✅ Explore unique values and frequencies
✅ Save and reload datasets
✅ Visualize key metrics

Pandas Functions (groupby)

Sample Dataset

Let’s create a small fictional stock price dataset.

import pandas as pd

data = {
    'Stock': ['AAPL', 'AAPL', 'AAPL', 'SNI', 'SNI', 'TJX', 'TJX', 'TJX', 'ZTS', 'ZTS'],
    'Price': [110, 108, 109, 72, 70, 65, 66, 67, 44, 46]
}

df = pd.DataFrame(data)
df

Grouping Data and Calculating Mean & Median

Grouping is key to analyzing subsets of your data.
Let’s calculate the mean and median stock prices for each stock.

# Group by 'Stock' and calculate the mean price
mean_prices = df.groupby('Stock')['Price'].mean()
print("Mean Prices:\n", mean_prices)

# Group by 'Stock' and calculate the median price
median_prices = df.groupby('Stock')['Price'].median()
print("Median Prices:\n", median_prices)

Sample Output:

Mean Prices:
Stock
AAPL    109.0
SNI      71.0
TJX      66.0
ZTS      45.0
Name: Price, dtype: float64

Median Prices:
Stock
AAPL    109.0
SNI      71.0
TJX      66.0
ZTS      45.0
Name: Price, dtype: float64

Writing and Applying Custom Functions

You can define your own transformation functions and use the .apply() method to apply them to DataFrame columns.

# Custom function to increase prices by 10%
def increase_by_10_percent(x):
    return x * 1.10

# Apply the function to 'Price' and create a new column
df['Increased_Price'] = df['Price'].apply(increase_by_10_percent)
df

Output Preview:

Stock Price Increased_Price
AAPL 110 121.0
AAPL 108 118.8
AAPL 109 119.9
SNI 72 79.2

Sorting DataFrames

Sorting helps you rank or order your data — for example, finding the highest prices.

# Sort by 'Increased_Price' in descending order
df_sorted = df.sort_values(by='Increased_Price', ascending=False)
df_sorted

Output Preview:

Stock Price Increased_Price
AAPL 110 121.0
AAPL 109 119.9
ZTS 44 48.4

Key Takeaways

✅ Use groupby() to aggregate data
✅ Calculate mean and median by group
✅ Apply custom functions using .apply()
✅ Create new columns with transformed data
✅ Sort DataFrames based on computed columns


Pandas Functions (datetime)

Working with date and time information is essential in many data science projects. The pandas library provides powerful tools for datetime manipulation.

  • Converting string columns to datetime
  • Handling different date formats
  • Extracting date components (year, month, day)
  • Performing date comparisons and calculations

Sample Data

Let’s start by creating a small CSV file mimicking a real-world dataset:

# save this as sample_dates.csv
id,date_purchased
1,01/12/2023
2,15/02/2023
3,23/07/2022
4,28/12/2020
5,03/03/2021

This CSV has two columns: id and date_purchased, where date_purchased is a string in dd/mm/yyyy format.

Load and Inspect Data

import pandas as pd

# Create a DataFrame directly
data = {
    'id': [1, 2, 3, 4, 5],
    'date_purchased': ['01/12/2023', '15/02/2023', '23/07/2022', '28/12/2020', '03/03/2021']
}
df = pd.DataFrame(data)
print(df)

Converting Strings to Datetime

By default, pandas treats date columns as strings, preventing date-specific operations.
Use pd.to_datetime() to fix this:

# Convert the 'date_purchased' column to datetime
df['date_purchased'] = pd.to_datetime(df['date_purchased'], dayfirst=True)
print(df.dtypes)

💡 Note:

  • dayfirst=True tells pandas that the date format is DD/MM/YYYY, common outside the US.

Handling Different Date Formats

Data from multiple regions may have different formats. Specify format settings to avoid confusion.

# US format: MM/DD/YYYY
us_dates = pd.Series(['12/01/2023', '02/15/2023'])
pd.to_datetime(us_dates, dayfirst=False)

# European format: DD/MM/YYYY
eu_dates = pd.Series(['01/12/2023', '15/02/2023'])
pd.to_datetime(eu_dates, dayfirst=True)

Extracting Year, Month, and Day

Once converted, datetime columns let you extract useful components:

df['year'] = df['date_purchased'].dt.year
df['month'] = df['date_purchased'].dt.month
df['day'] = df['date_purchased'].dt.day
print(df)

Creating New Columns from Datetime

Now, year, month, and day columns can help with time-based analysis.

import matplotlib.pyplot as plt

# Visualizing purchases by year
df['year'].value_counts().sort_index().plot(kind='bar')
plt.title('Number of Purchases by Year')
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()

This produces a bar chart summarizing purchases per year.

Comparing Dates & Datetime Arithmetic

You can filter and compute date-based metrics easily.

# Compare: Which purchases are after Jan 1, 2022?
after_2022 = df[df['date_purchased'] > '2022-01-01']
print(after_2022)

# Calculate days since purchase from today
df['days_since_purchase'] = (pd.Timestamp('today') - df['date_purchased']).dt.days
print(df[['id', 'date_purchased', 'days_since_purchase']])

Key Takeaways

  • ✅ Always convert date columns to pandas datetime type.
  • ✅ Specify the date format using parameters like dayfirst to avoid misinterpretation.
  • ✅ Easily extract year, month, day for deeper analysis.
  • ✅ Perform date comparisons and calculations effortlessly.

💻 Full Example Notebook

You can copy this entire block into Google Colab and run it end-to-end.

import pandas as pd
import matplotlib.pyplot as plt

# Sample data
data = {
    'id': [1, 2, 3, 4, 5],
    'date_purchased': ['01/12/2023', '15/02/2023', '23/07/2022', '28/12/2020', '03/03/2021']
}
df = pd.DataFrame(data)

# Convert to datetime
df['date_purchased'] = pd.to_datetime(df['date_purchased'], dayfirst=True)

# Extract components
df['year'] = df['date_purchased'].dt.year
df['month'] = df['date_purchased'].dt.month
df['day'] = df['date_purchased'].dt.day

# Visualize purchases by year
df['year'].value_counts().sort_index().plot(kind='bar')
plt.title('Number of Purchases by Year')
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()

# Datetime arithmetic
df['days_since_purchase'] = (pd.Timestamp('today') - df['date_purchased']).dt.days
print(df)