Foundations of Machine Learning: Inside Python’s Pandas Library
by Selwyn Davidraj Posted on October 01, 2025
In this blog, we will explore Pandas which is a powerful library in Python designed for data manipulation and analysis. If you’re diving into data science or want to automate data-related tasks, Pandas will be your go-to toolkit.
📘 Table of Contents
- Series and Dataframes
- Accessing and Modifying Series
- Accessing DataFrames
- Accessing and Modifying DataFrames (loc and iloc)
- Accessing and Modifying DataFrames (condition-based indexing)
- Combining DataFrames
- Saving and Loading Datasets
- Pandas Functions
- Pandas Functions (groupby)
- Pandas Functions (datetime)
Series and Dataframes
What is Pandas?
Pandas is a powerful, open-source Python library that provides flexible and expressive data structures to work with structured (tabular) data — much like working with spreadsheets, but more programmatic and scalable.
✨ Key Features
- Two main data structures: Series (1D) and DataFrame (2D)
- Easy data reading and writing
- Quick summary statistics and visualization
- Seamless integration with NumPy
Importing Pandas
To start using Pandas, import it at the top of your script or notebook.
The standard import pattern is:
import pandas as pd
This allows you to use the concise pd prefix when calling Pandas functions.
Pandas Data Structures: Series and DataFrames
- Series: A one-dimensional labeled array.
- DataFrame: A two-dimensional table (like an Excel sheet or SQL table).
Creating a Pandas Series
Let’s create a simple list of drug prices, then convert it into a Series.
import pandas as pd
# Sample drug prices in USD
drug_prices = [13.5, 24.8, 7.99, 4.25]
drug_series = pd.Series(drug_prices, index=['Aspirin', 'Paracetamol', 'Ibuprofen', 'Cetirizine'])
print(drug_series)
Output:
Aspirin 13.50
Paracetamol 24.80
Ibuprofen 7.99
Cetirizine 4.25
dtype: float64
Manipulating Pandas Series
You can easily perform arithmetic on Series objects:
# Increase all drug prices by $2.50
updated_prices = drug_series + 2.50
print(updated_prices)
Output:
Aspirin 16.00
Paracetamol 27.30
Ibuprofen 10.49
Cetirizine 6.75
dtype: float64
Or update data directly:
# Replace data with a new set
new_prices = [15.0, 28.5, 8.39, 5.10]
drug_series[:] = new_prices
print(drug_series)
Creating Pandas DataFrames
A DataFrame is like a spreadsheet or SQL table. Let’s create one with fictional student grades:
# Sample student data
student_data = {
"Name": ["Asha", "Ben", "Carla", "Dan"],
"Grade": [88, 92, 77, 85]
}
students_df = pd.DataFrame(student_data)
print(students_df)
Output:
Name Grade
0 Asha 88
1 Ben 92
2 Carla 77
3 Dan 85
Indexing and Labeling in Pandas
You can assign custom indices to DataFrames and easily access specific data.
# Assign new index labels
students_df.index = ['Student 1', 'Student 2', 'Student 3', 'Student 4']
print(students_df)
# Select the 'Grade' column
print(students_df['Grade'])
# Access row by label
print(students_df.loc['Student 2'])
Adding and Updating Values
You can easily add new columns or modify existing ones.
# Add a new column indicating pass/fail
students_df['Pass'] = students_df['Grade'] >= 80
print(students_df)
# Update a specific cell
students_df.at['Student 3', 'Grade'] = 80
print(students_df)
Working with Different Data Sources
Pandas can create DataFrames from various sources — dictionaries, lists, or NumPy arrays.
From a Dictionary
# Sample energy data
energy_data = {
"City": ["New York", "London", "Tokyo"],
"Consumption_kWh": [2500, 1800, 2700]
}
energy_df = pd.DataFrame(energy_data)
print(energy_df)
From a NumPy Array
import numpy as np
# Random dataset: 3 rows × 2 columns
array_data = np.random.rand(3, 2)
array_df = pd.DataFrame(array_data, columns=['Col1', 'Col2'])
print(array_df)
Accessing and Modifying Series
Creating a Sample Dataset
We’ll define operators and revenues as lists, then use them to create a Pandas Series.
# Sample data: Mobile operators and their 2020 revenues (in billions USD)
operators = ['AT&T', 'Verizon', 'T-Mobile', 'Sprint']
revenues = [171, 128, 68, 33]
# Creating a Pandas Series
revenue_series = pd.Series(data=revenues, index=operators)
print(revenue_series)
Output:
AT&T 171
Verizon 128
T-Mobile 68
Sprint 33
dtype: int64
Indexing and Accessing Values
Pandas Series can be indexed in two ways:
- By numeric position (like a list)
- By label (using the operator’s name)
Access by position:
# Accessing by position
print(revenue_series[0]) # Output: 171 (AT&T)
Access by label:
# Accessing by label
print(revenue_series['T-Mobile']) # Output: 68
Slicing the Series
Pandas supports slicing, similar to Python lists or NumPy arrays.
By numeric index (end-exclusive):
# Slicing by position (index 1 up to but not including 3)
print(revenue_series[1:3])
Output:
Verizon 128
T-Mobile 68
dtype: int64
By label (end-inclusive):
# Slicing by label (from 'Verizon' to 'T-Mobile', inclusive)
print(revenue_series['Verizon':'T-Mobile'])
Output:
Verizon 128
T-Mobile 68
dtype: int64
Key Point:
- Numeric slicing
[start:end]→ end-exclusive - Label slicing
['label1':'label2']→ end-inclusive
Accessing Multiple Items
You can select multiple non-contiguous entries by passing a list of indices or labels.
By position:
# Multiple entries by positions
print(revenue_series[[0, 2]])
Output:
AT&T 171
T-Mobile 68
dtype: int64
By label:
# Multiple entries by label
print(revenue_series[['AT&T', 'Sprint']])
Output:
AT&T 171
Sprint 33
dtype: int64
Full Sample Code (Copy/Paste for Google Colab)
import pandas as pd
# Sample data
operators = ['AT&T', 'Verizon', 'T-Mobile', 'Sprint']
revenues = [171, 128, 68, 33]
# Create Series
revenue_series = pd.Series(data=revenues, index=operators)
# Display Series
print("Full Series:")
print(revenue_series)
# Access by position
print("\nAT&T revenue (by position):")
print(revenue_series[0])
# Access by label
print("\nT-Mobile revenue (by label):")
print(revenue_series['T-Mobile'])
# Slicing by numeric index
print("\nSlice by numeric index [1:3]:")
print(revenue_series[1:3])
# Slicing by label
print("\nSlice by label ['Verizon':'T-Mobile']:")
print(revenue_series['Verizon':'T-Mobile'])
# Access multiple entries by position
print("\nMultiple entries by position [0, 2]:")
print(revenue_series[[0, 2]])
# Access multiple entries by label
print("\nMultiple entries by label ['AT&T', 'Sprint']:")
print(revenue_series[['AT&T', 'Sprint']])
Accessing DataFrames
Creating a Sample DataFrame
Let’s create a DataFrame representing simple customer transactions.
import pandas as pd
# Sample dataset as a dictionary
data = {
'CustomerID': [101, 102, 103, 104],
'Location': ['New York', 'Mumbai', 'Berlin', 'Tokyo'],
'Gender': ['F', 'M', 'M', 'F'],
'StoreType': ['Retail', 'Online', 'Retail', 'Online'],
'Quantity': [2, 5, 3, 1],
'Bill': [40.75, 120.00, 63.00, 20.50]
}
# Creating DataFrame
df = pd.DataFrame(data)
df
Output:
| CustomerID | Location | Gender | StoreType | Quantity | Bill |
|---|---|---|---|---|---|
| 101 | New York | F | Retail | 2 | 40.75 |
| 102 | Mumbai | M | Online | 5 | 120.00 |
| 103 | Berlin | M | Retail | 3 | 63.00 |
| 104 | Tokyo | F | Online | 1 | 20.50 |
Accessing Data in DataFrames
Access Columns
You can access a column using its name inside square brackets.
# Accessing the 'Location' column
print(df['Location'])
Output:
0 New York
1 Mumbai
2 Berlin
3 Tokyo
Name: Location, dtype: object
Access Rows by Index
You can use .iloc[] to access rows by their numeric index.
# Access the third row (index 2)
print(df.iloc[2])
Output:
CustomerID 103
Location Berlin
Gender M
StoreType Retail
Quantity 3
Bill 63.0
Name: 2, dtype: object
Access Specific Values
Use .at[row, column] for fast scalar access.
# Access the 'Bill' for the second customer (index 1)
print(df.at[1, 'Bill'])
Output:
120.0
Slicing and Advanced Indexing
You can slice rows or reverse them, similar to Python lists.
# Get the first 3 rows
print(df[:3])
# Get every other row
print(df[::2])
# Reverse the DataFrame rows
print(df[::-1])
Modifying DataFrames
Adding a New Column
Let’s add a new column called Discount.
df['Discount'] = [5, 15, 0, 2]
print(df)
Output:
| CustomerID | Location | Gender | StoreType | Quantity | Bill | Discount |
|---|---|---|---|---|---|---|
| 101 | New York | F | Retail | 2 | 40.75 | 5 |
| 102 | Mumbai | M | Online | 5 | 120.00 | 15 |
| 103 | Berlin | M | Retail | 3 | 63.00 | 0 |
| 104 | Tokyo | F | Online | 1 | 20.50 | 2 |
Updating Values
You can update values conditionally using .loc[].
# Give a 10% discount to all 'Online' purchases
df.loc[df['StoreType'] == 'Online', 'Discount'] = 10
print(df)
Output:
| CustomerID | Location | Gender | StoreType | Quantity | Bill | Discount |
|---|---|---|---|---|---|---|
| 101 | New York | F | Retail | 2 | 40.75 | 5 |
| 102 | Mumbai | M | Online | 5 | 120.00 | 10 |
| 103 | Berlin | M | Retail | 3 | 63.00 | 0 |
| 104 | Tokyo | F | Online | 1 | 20.50 | 10 |
Comparing DataFrames and NumPy Arrays
| Feature | NumPy Array | pandas DataFrame |
|---|---|---|
| Data types per column | Same type per array | Can differ per column |
| Access | By index only | By index or column name |
| Flexibility | Numeric focus | Works well with structured, real-world data |
Accessing and Modifying DataFrames (loc and iloc)
Creating a Sample DataFrame
Let’s start by creating a simple DataFrame to work with:
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [24, 27, 22, 32],
'City': ['New York', 'Paris', 'London', 'Berlin'],
'Score': [85, 90, 78, 88]
}
df = pd.DataFrame(data)
print(df)
Output:
| Name | Age | City | Score | |
|---|---|---|---|---|
| 0 | Alice | 24 | New York | 85 |
| 1 | Bob | 27 | Paris | 90 |
| 2 | Charlie | 22 | London | 78 |
| 3 | David | 32 | Berlin | 88 |
Accessing Rows and Columns
Accessing Columns
# Accessing a column by name
print(df['Name'])
Output:
0 Alice
1 Bob
2 Charlie
3 David
Name: Name, dtype: object
Accessing Rows with Square Brackets
# Access a row by label or position
print(df.loc[1]) # Access by index label (row 1)
print(df.iloc[2]) # Access by integer position (row 2)
Using loc and iloc
.loc[]uses labels (row and column names)..iloc[]uses integer indices (row and column positions).
Accessing Specific Entries
# Access entry using loc [row label, column name]
print(df.loc[0, 'City']) # Output: New York
# Access entry using iloc [row index, column index]
print(df.iloc[2, 3]) # Output: 78
Accessing Multiple Rows/Columns
# Access multiple rows and columns
print(df.loc[[1, 3], ['Name', 'Score']])
print(df.iloc[[0, 2], [1, 2]])
Modifying Entries in DataFrame
You can update entries using .loc[] and .iloc[] combined with assignment.
# Modify a single value
df.loc[1, 'Score'] = 92
# Modify multiple values
df.loc[[0, 2], 'Score'] = [87, 80]
# Using iloc
df.iloc[3, 1] = 33 # Updates David's age
print(df)
Updated DataFrame:
| Name | Age | City | Score | |
|---|---|---|---|---|
| 0 | Alice | 24 | New York | 87 |
| 1 | Bob | 27 | Paris | 92 |
| 2 | Charlie | 22 | London | 80 |
| 3 | David | 33 | Berlin | 88 |
Visualizing Changes
Let’s see the score distribution as a simple bar chart using matplotlib.
import matplotlib.pyplot as plt
plt.bar(df['Name'], df['Score'], color='skyblue')
plt.xlabel('Name')
plt.ylabel('Score')
plt.title('Scores of Individuals')
plt.show()
This visualization helps confirm the updates made to our data.
Summary Table
| Operation | Code Example | Description |
|---|---|---|
| Access a column | df['Name'] |
Get column as Series |
| Access a row by label | df.loc[1] |
Returns Series |
| Access a row by index | df.iloc[1] |
Returns Series |
| Access specific entry | df.loc[2, 'City'] |
Get Charlie’s city |
| Access multiple rows/columns | df.loc[[0,2], ['Name', 'Score']] |
Subset rows/cols |
| Modify an entry with loc/iloc | df.loc[1, 'Score'] = 92 |
Updates Bob’s score |
Some Takeaways
- Use square brackets to quickly access rows or columns.
- Use
.loc[]for labels and.iloc[]for indices. - Both methods can select or modify entries efficiently.
- Mastering
.locand.ilocis essential for interactive data exploration and data manipulation in Pandas.
Accessing and Modifying DataFrames (condition-based indexing)
Creating a Sample DataFrame
Let’s use a small DataFrame representing store purchases for easy experimentation.
import pandas as pd
data = {
'customer_id': [1, 2, 3, 4, 5],
'product': ['apple', 'banana', 'cherry', 'date', 'elderberry'],
'quantity': [2, 1, 5, 3, 1],
'price': [0.99, 0.59, 2.99, 1.49, 4.99]
}
df = pd.DataFrame(data)
df
Output:
| customer_id | product | quantity | price |
|---|---|---|---|
| 1 | apple | 2 | 0.99 |
| 2 | banana | 1 | 0.59 |
| 3 | cherry | 5 | 2.99 |
| 4 | date | 3 | 1.49 |
| 5 | elderberry | 1 | 4.99 |
Accessing & Modifying DataFrame Values
Access by Index and Labels
.loc: Access by label (row and column names)..iloc: Access by position (row and column numbers).
# Using .loc (label)
print(df.loc[2, 'product']) # Output: cherry
# Using .iloc (position)
print(df.iloc[0, 2]) # Output: 2 (quantity of first row)
Logical Indexing in DataFrames
Logical indexing lets you filter rows by specific conditions.
# Find all rows where quantity is greater than 1
filtered = df[df['quantity'] > 1]
print(filtered)
Output:
| customer_id | product | quantity | price |
|---|---|---|---|
| 1 | apple | 2 | 0.99 |
| 3 | cherry | 5 | 2.99 |
| 4 | date | 3 | 1.49 |
💡 Key Insight: Logical indexing is crucial for filtering and selecting subsets of data efficiently.
➕ Adding Columns
Creating new columns in a DataFrame is simple and intuitive.
# Add a 'rating' column
df['rating'] = [4.5, 4.0, 4.2, 4.7, 4.8]
print(df)
Output:
| customer_id | product | quantity | price | rating |
|---|---|---|---|---|
| 1 | apple | 2 | 0.99 | 4.5 |
| 2 | banana | 1 | 0.59 | 4.0 |
| 3 | cherry | 5 | 2.99 | 4.2 |
| 4 | date | 3 | 1.49 | 4.7 |
| 5 | elderberry | 1 | 4.99 | 4.8 |
➖ Dropping Columns
Use .drop() to remove columns. By default, it returns a new DataFrame.
# Drop the 'customer_id' column
df_dropped = df.drop('customer_id', axis=1)
print(df_dropped)
# Drop in place
df.drop('customer_id', axis=1, inplace=True)
print(df)
Dropping Rows
You can remove rows by their index number.
# Drop the row at index 1
df_dropped_row = df.drop(1, axis=0)
print(df_dropped_row)
Resetting Indices
After dropping rows, the index might be non-sequential. Reset it using .reset_index().
# Reset the index after dropping a row
df_reset = df.drop(1).reset_index(drop=True)
print(df_reset)
Copying DataFrames
To modify data without affecting the original, use .copy().
# Create a copy of the DataFrame
df_copy = df.copy()
# Changes to df_copy do NOT affect df
Recap
✅ Access DataFrame values using .loc and .iloc.
✅ Filter rows efficiently with logical indexing.
✅ Add new columns easily with direct assignment.
✅ Drop columns/rows with .drop(); use inplace=True to modify the original DataFrame.
✅ Reset indices when needed using .reset_index().
✅ Copy DataFrames safely using .copy() to prevent unintended changes.
✅ Visualize data quickly using matplotlib.
Combining DataFrames
In data science, you’ll often work with data spread across multiple sources. Combining these datasets efficiently is crucial for comprehensive analysis. Pandas, offers several flexible methods for merging, joining, and concatenating DataFrames.
Sample Data
We’ll use simple artificial datasets for demonstration.
# Sample customer data
customers = pd.DataFrame({
'customerID': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Carol', 'David']
})
# Sample purchase data
purchases = pd.DataFrame({
'customerID': [2, 4, 4, 5],
'item': ['Book', 'Pen', 'Pencil', 'Notebook']
})
# Another DataFrame with different columns
rewards = pd.DataFrame({
'customerID': [1, 2, 3, 5],
'points': [100, 150, 200, 50]
})
print("Customers:\n", customers)
print("\nPurchases:\n", purchases)
print("\nRewards:\n", rewards)
Concatenation of DataFrames
Concatenation stacks DataFrames by rows (axis=0) or by columns (axis=1).
Concatenating by Rows
Useful when DataFrames share similar columns.
result_row = pd.concat([customers, rewards], ignore_index=True)
print(result_row)
Result:
| customerID | name | points |
|---|---|---|
| 1 | Alice | NaN |
| 2 | Bob | NaN |
| 3 | Carol | NaN |
| 4 | David | NaN |
| 1 | NaN | 100.0 |
| 2 | NaN | 150.0 |
| 3 | NaN | 200.0 |
| 5 | NaN | 50.0 |
💡 Note: Columns not shared between DataFrames are filled with NaN.
Concatenating by Columns
result_col = pd.concat([customers, rewards], axis=1)
print(result_col)
⚠️ Be careful: Row alignment can cause mismatched data if lengths differ.
Merging DataFrames
Merging aligns DataFrames on common column values (like SQL joins). Controlled by on and how parameters.
Inner Merge
Keeps only rows with matching values.
merged_inner = pd.merge(customers, purchases, on='customerID', how='inner')
print(merged_inner)
Result:
| customerID | name | item |
|---|---|---|
| 2 | Bob | Book |
| 4 | David | Pen |
| 4 | David | Pencil |
Outer Merge
Includes all rows from both DataFrames; fills missing with NaN.
merged_outer = pd.merge(customers, purchases, on='customerID', how='outer')
print(merged_outer)
Result:
| customerID | name | item |
|---|---|---|
| 1 | Alice | NaN |
| 2 | Bob | Book |
| 3 | Carol | NaN |
| 4 | David | Pen |
| 4 | David | Pencil |
| 5 | NaN | Notebook |
Left and Right Merges
- Left Join: Keep all rows from
customers, bring matching ones frompurchases. - Right Join: Keep all rows from
purchases, bring matching ones fromcustomers.
merged_left = pd.merge(customers, purchases, on='customerID', how='left')
print("Left Merge:\n", merged_left)
merged_right = pd.merge(customers, purchases, on='customerID', how='right')
print("Right Merge:\n", merged_right)
Handling Column Name Conflicts
If both DataFrames share non-key column names, Pandas adds suffixes.
df1 = pd.DataFrame({'ID': [1, 2], 'score': [90, 85]})
df2 = pd.DataFrame({'ID': [1, 2], 'score': [80, 88]})
pd.merge(df1, df2, on='ID', suffixes=('_left', '_right'))
Result:
| ID | score_left | score_right |
|---|---|---|
| 1 | 90 | 80 |
| 2 | 85 | 88 |
Joining DataFrames (Index-Based)
.join() merges DataFrames using their index — handy when data is already indexed.
# Set index before joining
customers_indexed = customers.set_index('customerID')
rewards_indexed = rewards.set_index('customerID')
joined_df = customers_indexed.join(rewards_indexed, how='outer')
print(joined_df)
Result:
| customerID | name | points |
|---|---|---|
| 1 | Alice | 100.0 |
| 2 | Bob | 150.0 |
| 3 | Carol | 200.0 |
| 4 | David | NaN |
| 5 | NaN | 50.0 |
Key Takeaways
✅ Concatenation → Stack DataFrames (vertically or horizontally).
✅ Merging → Combine on column keys (like SQL joins).
✅ Joining → Merge on index, ideal for already indexed data.
✅ Suffixes help when merging overlapping column names.
✅ Always inspect alignment and NaN values after merging!
Saving and Loading Datasets
Sample Dataset
Let’s create a small CSV and Excel dataset to use for demonstrations.
Sample Data: products.csv
ProductID,ProductName,Category,Price,Stock
1001,Widget A,Widgets,25.5,120
1002,Widget B,Widgets,35.0,60
1003,Gadget X,Gadgets,99.99,25
1004,Doohickey Y,Doohickeys,7.95,200
1005,Gizmo Z,Gizmos,49.5,0
💡 You can easily create this file in Colab or locally using any text editor.
Creating and Viewing DataFrames
Let’s start by manually creating a DataFrame to understand the basics.
import pandas as pd
# Create a small DataFrame
data = {
'ProductID': [1001, 1002, 1003],
'ProductName': ['Widget A', 'Widget B', 'Gadget X'],
'Category': ['Widgets', 'Widgets', 'Gadgets'],
'Price': [25.5, 35.0, 99.99],
'Stock': [120, 60, 25]
}
df = pd.DataFrame(data)
print(df)
Output:
| ProductID | ProductName | Category | Price | Stock |
|---|---|---|---|---|
| 1001 | Widget A | Widgets | 25.5 | 120 |
| 1002 | Widget B | Widgets | 35.0 | 60 |
| 1003 | Gadget X | Gadgets | 99.99 | 25 |
Loading Data from CSV and Excel
For larger datasets, it’s more efficient to load data from files.
Load from CSV
# Assuming 'products.csv' is uploaded in your working directory
df = pd.read_csv('products.csv')
print(df.head())
Load from Excel
# Create and save an Excel file
df.to_excel('products.xlsx', index=False)
# Read it back
df_excel = pd.read_excel('products.xlsx')
print(df_excel.head())
Google Colab: Working with Drive
If you’re using Google Colab, mount your Drive to access files easily.
from google.colab import drive
drive.mount('/content/drive')
# Update the path to your file in Drive
file_path = '/content/drive/My Drive/data/products.csv'
df = pd.read_csv(file_path)
Key Points:
- Mount Google Drive for file access.
- Adjust file paths according to your Drive’s folder structure.
Exploring DataFrames
Once loaded, you can explore and inspect your data using Pandas’ built-in methods.
# View first few rows
print(df.head())
# View info about columns and data types
print(df.info())
# Display summary statistics
print(df.describe())
Saving DataFrames
After data manipulation or analysis, save your DataFrame back to a file.
Save as CSV
df.to_csv('products_updated.csv', index=False)
Save as Excel
df.to_excel('products_updated.xlsx', index=False)
💡 Tip: Always set index=False to prevent Pandas from saving index values as a separate column.
Key Takeaways
✅ Create DataFrames manually or by loading data from CSV/Excel.
✅ Explore your data with head(), info(), and describe().
✅ Visualize data using matplotlib.
✅ Save results back to CSV or Excel for further use.
✅ When using Google Colab, remember to mount Google Drive for file access.
Pandas Functions
Setup and Sample Data
Let’s create a small DataFrame for our analysis.
import pandas as pd
# Sample data
data = {
'stock': ['Apple', 'SNI', 'TJX', 'ZTS'] * 4,
'date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'] * 4,
'price': [175.00, 45.60, 74.59, 28.40, 178.25, 46.10, 74.59, 29.00,
179.26, 46.10, 69.08, 30.00, 150.00, 46.10, 69.08, 32.00]
}
df = pd.DataFrame(data)
df
Sample Output:
| stock | date | price |
|---|---|---|
| Apple | 2023-01-01 | 175.00 |
| SNI | 2023-01-02 | 45.60 |
| TJX | 2023-01-03 | 74.59 |
| ZTS | 2023-01-04 | 28.40 |
| … | … | … |
Viewing DataFrame Rows
Quickly inspect your dataset using head() and tail():
# First five rows
df.head()
# Last five rows
df.tail()
Understanding Data Structure
Use Pandas utilities to get a summary of the dataset.
print("Shape:", df.shape)
print("Columns:", df.columns.tolist())
df.info()
Explanation:
shape: Returns number of rows and columnscolumns: Lists column namesinfo(): Displays data types and non-null counts
Data Types in Pandas
Understanding column data types helps guide your analysis.
print(df.dtypes)
Expect text columns (object) for stock names and dates, and numerical (float64) for prices.
Extracting Columns & Performing Calculations
You can select a column using bracket notation (returns a Series).
prices = df['price']
print(prices.head())
Statistical Measures
Pandas makes it easy to compute descriptive statistics:
print("Min price:", prices.min())
print("Max price:", prices.max())
print("Mean price:", prices.mean())
print("Median price:", prices.median())
print("Mode price:", prices.mode().tolist())
Unique Values and Value Counts
Explore distinct values and their frequencies.
# Unique stock names
print(df['stock'].unique())
# Frequency of each stock
print(df['stock'].value_counts())
# Proportional frequency (percentage)
print(df['stock'].value_counts(normalize=True))
Handling Multiple Modes
Sometimes a column has more than one most frequent value.
modes = df['price'].mode()
print("Mode(s):", modes.tolist())
Saving & Loading DataFrames
Easily export and reload your data with Pandas.
# Save to CSV
df.to_csv('sample_stocks.csv', index=False)
# Load the saved file
df_loaded = pd.read_csv('sample_stocks.csv')
print(df_loaded.head())
Key Takaways
Using Pandas, you can efficiently analyze and summarize tabular data.
✅ Create and preview DataFrames
✅ Inspect structure and data types
✅ Extract and compute statistics
✅ Explore unique values and frequencies
✅ Save and reload datasets
✅ Visualize key metrics
Pandas Functions (groupby)
Sample Dataset
Let’s create a small fictional stock price dataset.
import pandas as pd
data = {
'Stock': ['AAPL', 'AAPL', 'AAPL', 'SNI', 'SNI', 'TJX', 'TJX', 'TJX', 'ZTS', 'ZTS'],
'Price': [110, 108, 109, 72, 70, 65, 66, 67, 44, 46]
}
df = pd.DataFrame(data)
df
Grouping Data and Calculating Mean & Median
Grouping is key to analyzing subsets of your data.
Let’s calculate the mean and median stock prices for each stock.
# Group by 'Stock' and calculate the mean price
mean_prices = df.groupby('Stock')['Price'].mean()
print("Mean Prices:\n", mean_prices)
# Group by 'Stock' and calculate the median price
median_prices = df.groupby('Stock')['Price'].median()
print("Median Prices:\n", median_prices)
Sample Output:
Mean Prices:
Stock
AAPL 109.0
SNI 71.0
TJX 66.0
ZTS 45.0
Name: Price, dtype: float64
Median Prices:
Stock
AAPL 109.0
SNI 71.0
TJX 66.0
ZTS 45.0
Name: Price, dtype: float64
Writing and Applying Custom Functions
You can define your own transformation functions and use the .apply() method to apply them to DataFrame columns.
# Custom function to increase prices by 10%
def increase_by_10_percent(x):
return x * 1.10
# Apply the function to 'Price' and create a new column
df['Increased_Price'] = df['Price'].apply(increase_by_10_percent)
df
Output Preview:
| Stock | Price | Increased_Price |
|---|---|---|
| AAPL | 110 | 121.0 |
| AAPL | 108 | 118.8 |
| AAPL | 109 | 119.9 |
| SNI | 72 | 79.2 |
| … | … | … |
Sorting DataFrames
Sorting helps you rank or order your data — for example, finding the highest prices.
# Sort by 'Increased_Price' in descending order
df_sorted = df.sort_values(by='Increased_Price', ascending=False)
df_sorted
Output Preview:
| Stock | Price | Increased_Price |
|---|---|---|
| AAPL | 110 | 121.0 |
| AAPL | 109 | 119.9 |
| … | … | … |
| ZTS | 44 | 48.4 |
Key Takeaways
✅ Use groupby() to aggregate data
✅ Calculate mean and median by group
✅ Apply custom functions using .apply()
✅ Create new columns with transformed data
✅ Sort DataFrames based on computed columns
Pandas Functions (datetime)
Working with date and time information is essential in many data science projects. The pandas library provides powerful tools for datetime manipulation.
- Converting string columns to datetime
- Handling different date formats
- Extracting date components (year, month, day)
- Performing date comparisons and calculations
Sample Data
Let’s start by creating a small CSV file mimicking a real-world dataset:
# save this as sample_dates.csv
id,date_purchased
1,01/12/2023
2,15/02/2023
3,23/07/2022
4,28/12/2020
5,03/03/2021
This CSV has two columns: id and date_purchased, where date_purchased is a string in dd/mm/yyyy format.
Load and Inspect Data
import pandas as pd
# Create a DataFrame directly
data = {
'id': [1, 2, 3, 4, 5],
'date_purchased': ['01/12/2023', '15/02/2023', '23/07/2022', '28/12/2020', '03/03/2021']
}
df = pd.DataFrame(data)
print(df)
Converting Strings to Datetime
By default, pandas treats date columns as strings, preventing date-specific operations.
Use pd.to_datetime() to fix this:
# Convert the 'date_purchased' column to datetime
df['date_purchased'] = pd.to_datetime(df['date_purchased'], dayfirst=True)
print(df.dtypes)
💡 Note:
dayfirst=Truetells pandas that the date format is DD/MM/YYYY, common outside the US.
Handling Different Date Formats
Data from multiple regions may have different formats. Specify format settings to avoid confusion.
# US format: MM/DD/YYYY
us_dates = pd.Series(['12/01/2023', '02/15/2023'])
pd.to_datetime(us_dates, dayfirst=False)
# European format: DD/MM/YYYY
eu_dates = pd.Series(['01/12/2023', '15/02/2023'])
pd.to_datetime(eu_dates, dayfirst=True)
Extracting Year, Month, and Day
Once converted, datetime columns let you extract useful components:
df['year'] = df['date_purchased'].dt.year
df['month'] = df['date_purchased'].dt.month
df['day'] = df['date_purchased'].dt.day
print(df)
Creating New Columns from Datetime
Now, year, month, and day columns can help with time-based analysis.
import matplotlib.pyplot as plt
# Visualizing purchases by year
df['year'].value_counts().sort_index().plot(kind='bar')
plt.title('Number of Purchases by Year')
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()
This produces a bar chart summarizing purchases per year.
Comparing Dates & Datetime Arithmetic
You can filter and compute date-based metrics easily.
# Compare: Which purchases are after Jan 1, 2022?
after_2022 = df[df['date_purchased'] > '2022-01-01']
print(after_2022)
# Calculate days since purchase from today
df['days_since_purchase'] = (pd.Timestamp('today') - df['date_purchased']).dt.days
print(df[['id', 'date_purchased', 'days_since_purchase']])
Key Takeaways
- ✅ Always convert date columns to pandas datetime type.
- ✅ Specify the date format using parameters like
dayfirstto avoid misinterpretation. - ✅ Easily extract year, month, day for deeper analysis.
- ✅ Perform date comparisons and calculations effortlessly.
💻 Full Example Notebook
You can copy this entire block into Google Colab and run it end-to-end.
import pandas as pd
import matplotlib.pyplot as plt
# Sample data
data = {
'id': [1, 2, 3, 4, 5],
'date_purchased': ['01/12/2023', '15/02/2023', '23/07/2022', '28/12/2020', '03/03/2021']
}
df = pd.DataFrame(data)
# Convert to datetime
df['date_purchased'] = pd.to_datetime(df['date_purchased'], dayfirst=True)
# Extract components
df['year'] = df['date_purchased'].dt.year
df['month'] = df['date_purchased'].dt.month
df['day'] = df['date_purchased'].dt.day
# Visualize purchases by year
df['year'].value_counts().sort_index().plot(kind='bar')
plt.title('Number of Purchases by Year')
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()
# Datetime arithmetic
df['days_since_purchase'] = (pd.Timestamp('today') - df['date_purchased']).dt.days
print(df)
Previous article