Pandas is a powerful data manipulation library in Python that simplifies working with structured data. In this article, we’ll walk through three crucial topics that every data enthusiast or professional must master:
Reading, Writing, and Selecting Data with Pandas
Data Cleaning and Handling Missing Values in Pandas
Aggregation, Grouping, and Combining Data in Pandas
Let’s dive right in! 🚀
The most common format for structured data is CSV (comma-separated values). Pandas provides the read_csv()
function for that.
import pandas as pd
# Reading a CSV file
df = pd.read_csv("data.csv")
print(df.head()) # First 5 rows
You can also read from Excel, JSON, and SQL databases:
# Excel
df = pd.read_excel("data.xlsx")
# JSON
df = pd.read_json("data.json")
Save your DataFrame to various formats using:
# Save to CSV
df.to_csv("output.csv", index=False)
# Save to Excel
df.to_excel("output.xlsx", index=False)
Accessing Columns
df['column_name']
df[['col1', 'col2']]
Accessing Rows
df.loc[0] # By label/index
df.iloc[0] # By position
Filtering Rows
# All rows where age > 25
df[df['age'] > 25]
df.isnull().sum()
df.dropna(inplace=True) # Drop rows with any missing values
You can also drop rows/columns selectively:
df.dropna(subset=['column1'], inplace=True)
# Fill with a constant
df.fillna(0, inplace=True)
# Fill with mean of a column
df['salary'].fillna(df['salary'].mean(), inplace=True)
# Replace specific values
df.replace("N/A", pd.NA, inplace=True)
import numpy as np
data = {
'name': ['Alice', 'Bob', 'Charlie', np.nan],
'age': [25, np.nan, 30, 22],
'salary': [50000, 60000, np.nan, 40000]
}
df = pd.DataFrame(data)
df.fillna({'name': 'Unknown', 'age': df['age'].mean(), 'salary': df['salary'].median()}, inplace=True)
print(df)
# Get summary statistics
df.describe()
# Mean of a column
df['salary'].mean()
# Group by department and calculate average salary
df.groupby('department')['salary'].mean()
You can also apply multiple aggregations:
df.groupby('department')['salary'].agg(['mean', 'max', 'min'])
Concatenation
pd.concat([df1, df2], axis=0)
Merging
pd.merge(df1, df2, on='employee_id', how='inner')
Joining (on index)
df1.join(df2, how='outer')
df_sales = pd.DataFrame({
'store': ['A', 'B', 'C'],
'sales': [1000, 1500, 2000]
})
df_region = pd.DataFrame({
'store': ['A', 'B', 'C'],
'region': ['North', 'East', 'West']
})
# Merge both DataFrames
df_merged = pd.merge(df_sales, df_region, on='store')
print(df_merged)
# Group by region and get total sales
print(df_merged.groupby('region')['sales'].sum())
Pandas is a must-know for any data analyst or backend developer dealing with structured data. Mastering these concepts — reading/writing data, cleaning it, and aggregating — will significantly boost your productivity and understanding of data pipelines.
Join Jugal on Peerlist!
Join amazing folks like Jugal and thousands of other people in tech.
Create ProfileJoin with Jugal’s personal invite link.
1
15
0