Hands-On Guide to Pandas for Data Manipulation

Pandas is a powerful Python library used for data manipulation and analysis. It allows you to load, clean, and transform data easily into formats that are convenient for analysis or modeling. This blog aims to provide a hands-on guide to using Pandas for essential data manipulation tasks like importing data, selecting columns, filtering rows, grouping data, joining datasets and more. By learning Pandas skills, you will be able to work with real-world datasets more efficiently. These skills are highly useful for anyone pursuing a Python Data Science course in Delhi or looking to build a career in data analysis.

Introduction to Pandas

Pandas is an open-source Python library used for data manipulation and analysis. It is built on the NumPy library and its key feature is that it enables users to store and manipulate data in tabular form similar to a spreadsheet. Data in Pandas is stored and manipulated in specific data structures called Series (1D) and DataFrame (2D). This makes Pandas extremely useful for working with structured or tabular data like data from databases, CSV files etc. In this blog, we will go through the basics of Pandas and see how it can be used to load, clean, transform and visualize data.

Installing Pandas

Pandas is included in the Anaconda distribution of Python which is the easiest way to get started with Pandas. To install Pandas using pip, open a terminal or command prompt and run:

Copy

pip install pandas

This will install the latest stable version of Pandas. You can also install a specific version like pandas==1.2.0. To check if Pandas is installed correctly, open a Python interpreter and import pandas. If there are no errors, it means Pandas is installed properly.

Loading Data into Pandas

There are various ways to load data into Pandas. Some common ways are:

Load data from CSV/text files:

python

Copy

import pandas as pd

df = pd.read_csv(‘data.csv’)

Load data from Excel files:

python

Copy

df = pd.read_excel(‘data.xlsx’)

Load data from SQL databases:

python

Copy

import pandas as pd

import sqlite3

conn = sqlite3.connect(‘data.db’)

df = pd.read_sql_query(“SELECT * FROM table”, conn)

Load data from JSON files:

python

Copy

df = pd.read_json(‘data.json’)

Load data from online API or URL:

python

Copy

df = pd.read_json(‘https://api.example.com/data’)

Viewing and Inspecting

Data Once data is loaded into a DataFrame, we can view and inspect it. Some common ways are:

View top/bottom rows:

python

Copy

df.head()

df.tail()

Get shape and data types:

python

Copy

df.shape

df.dtypes

Get column names/indexes:

python

Copy

df.columns

df.index

View summary statistics:

python

Copy

df.describe()

View a sample of the data:

python

Copy

df.sample(5)

This helps get an overview of the data structure and content before proceeding with any manipulations.

Data Manipulation with Pandas

Pandas provides various methods to manipulate the data in DataFrames. Here are some common manipulations:

Selecting Columns:

python

Copy

df[[‘col1’,‘col2’]]

Filtering Rows:

python

Copy

df[df[‘col’] > 0]

Sorting values:

python

Copy

df.sort_values(‘col’)

Grouping and Aggregation:

python

Copy

df.groupby(‘col’).sum()

Joining/Merging DataFrames:

python

Copy

df1.merge(df2, on=‘key’)

Reshaping between wide to long format:

python

Copy

df = df.melt(id_vars=[‘id’], var_name=‘variable’)

Adding/Removing Columns:

python

Copy

df[‘new_col’] = df[‘col1’] + df[‘col2’]

df.drop(‘col3’, axis=1)

These are some basic yet powerful manipulations that Pandas enables on DataFrames.

Data Cleaning with Pandas

Real-world data often contains errors, missing values and inconsistencies which need cleaning before analysis. Pandas provides tools to handle these issues:

Handle missing data:

python

Copy

df.isnull().sum()

df = df.fillna(0)

Data type conversions:

python

Copy

df[‘col’] = df[‘col’].astype(int)

Remove duplicates:

python

Copy

df.drop_duplicates()

Remove outliers:

python

Copy

q1 = df[‘col’].quantile(0.25)

q3 = df[‘col’].quantile(0.75)

iqr = q3-q1

df = df[(df[‘col’] > (q1-1.5*iqr)) & (df[‘col’] < (q3+1.5*iqr))]

Handle inconsistent/invalid data:

python

Copy

df = df[df[‘col’].isin([‘a’,‘b’,‘c’])]

Cleaning data thoroughly is an important step before analysis and modeling to get meaningful insights.

Data Aggregation and Grouping

Pandas provides powerful aggregation functionality using the groupby() method. We can group data by one or more columns and aggregate using common functions like sum, mean, count etc. For example:

python

Copy

df.groupby(‘category’)[‘sales’].sum()

# Sum of sales for each category

df.groupby([‘year’,‘quarter’])[‘revenue’].mean()

# Average revenue for each year-quarter combination

We can also aggregate on multiple columns at once:

python

Copy

df.groupby([‘category’,‘city’])[[‘sales’,‘units’]].sum()

This is useful for tasks like finding top products by city, analyzing trends over time etc. Aggregation is an essential part of exploratory data analysis.

Handling Missing Data

Real-world data often contains missing values which need to be handled properly before analysis. Pandas provides various methods to handle missing data:

Check for missing data:

python

Copy

df.isnull().sum()

Fill missing with a value:

python

Copy

df = df.fillna(0)

Fill missing forward or backward:

python

Copy

df[‘col’] = df[‘col’].fillna(method=‘ffill’)

Drop rows with any missing value:

python

Copy

df.dropna()

Impute missing with statistical values:

python

Copy

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy=‘most_frequent’)

Choosing the right missing data handling technique depends on the nature of missingness and analysis goals. Pandas provides flexibility.

Data Visualization with Pandas Pandas has built-in plotting capabilities and also integrates well with third-party libraries like Matplotlib, Seaborn etc. Some common plots include:

Line plot:

python

Copy

df.plot(kind=‘line’, x=‘year’, y=‘sales’)

Bar plot:

python

Copy

df.groupby(‘category’)[‘sales’].sum().plot(kind=‘bar’)

Histogram:

python

Copy

df[‘col’].plot(kind=‘hist’)

Scatter plot:

python

Copy

df.plot(kind=‘scatter’, x=‘x_col’, y=‘y_col’)

Box and whisker plot:

python

Copy

df.boxplot(column=[‘col1′,’col2’])

“`1′,’col2′])

Visualizations help identify patterns, outliers and get insights from the data in a more intuitive way. Pandas integrates well with Python’s rich ecosystem of visualization libraries.

Advanced Pandas Techniques

Here are some advanced techniques in Pandas:

Merge/Join DataFrames:

python

Copy

df1.merge(df2, on=‘key’, how=‘outer’)

Reshape between long and wide format:

python

Copy

df = df.set_index([‘id’,‘time’]).unstack()

Time series functionality:

python

Copy

df[‘date’] = pd.to_datetime(df[‘date’])

df.resample(‘M’).mean()

Rolling/Expanding windows:

python

Copy

df[‘rolling_mean’] = df[‘col’].rolling(7).mean()

Apply custom functions row-wise or column-wise:

python

Copy

df[[‘c1’,‘c2’]].apply(lambda x: x[0]*x[1], axis=1)

Parallelize computations:

python

Copy

df.groupby(‘col’).transform(func)

Mastering these advanced techniques helps unlock the full power of Pandas for complex data analysis tasks.

Conclusion:

Mastering Data Manipulation with Pandas In this blog, we covered the basics of Pandas including data loading, inspection, cleaning, transformation, aggregation, visualization and some advanced techniques. Pandas provides a rich set of data structures and methods to efficiently manipulate structured data. With its NumPy underpinnings and integration with the Python ecosystem, Pandas has become the most popular tool for data analysis and exploration.