Pandas Cheat Sheet - Studyopedia

05 Aug Pandas Cheat Sheet

Posted at 14:33h in Pandas by Studyopedia Editorial Staff 0 Comments

Pandas Cheat Sheet will guide you to work on Pandas with basics and advanced topics. Cheat Sheet for students, engineers, and professionals.

Introduction

Pandas is a powerful and easy-to-use open-source tool built on top of the Python programming language. It is useful for data analysis and manipulation. Python with pandas is widely used in Statistics, Finance, Neuroscience, Economics, Web Analytics, Advertising, etc.

Features

The following are the features of the Pandas Library:

Analyze Data
Manipulate Data
Group the rows/ columns of a DataFrame/ Series
Plotting is possible
Fix the inaccurate data
Clean the Data completely

Installation

To install Pandas, use the PIP package manager. Install Python and PIP, and then use PIP to install the Pandas Python library:

pip install pandas

DataFrames in Pandas

The Pandas DataFrame is a Two-dimensional, tabular data, table with rows and columns. The DataFrame() method is used for this purpose and has the following parameters:

data: The data to be stored in the Pandas DataFrame
index: The index values to be provided for the resultant frame.
columns: Set the column labels for the resultant frame if data does not mention before
dtype: It is the datatype and only a single type is allowed.
copy: To copy the input data

Let us see how to create a Pandas DataFrame:

import pandas as pd

# Dataset

data = {

'student': ["Amit", "John", "Jacob", "David", "Steve"],

'rank': [1, 4, 3, 5, 2],

'marks': [95, 70, 80, 60, 90]

}

res = pd.DataFrame(data)

print("Student Records\n\n",res)

Output

Student Records

student rank marks

0 Amit 1 95

1 John 4 70

2 Jacob 3 80

3 David 5 60

4 Steve 2 90

DataFrame – Attributes and Methods

Let us see such attributes and methods in Python Pandas for DataFrame:

dtypes: Return the dtypes in the DataFrame
ndim: Return the number of dimensions of the DataFrame
size: Return the number of elements in the DataFrame.
shape: Return the dimensionality of the DataFrame in the form of a tuple.
index: Return the index of the DataFrame
T: Transpose the rows and columns
head(): Return the first n rows.
tail(): Return the last n rows.

Series in Pandas

Series in Pandas is a one-dimensional array, like a column in a table. It is a labeled array that can hold data of any type. The Series() method is used for this and has the following parameters:

data: The data to be stored in the Pandas Series
index: The index values should have the same length as the data.
dtype: It is the datatype for the output Series.
name: Set the series name with the name parameter
copy: To copy the input data

Let us now see an example to create a Pandas Series:

import pandas as pd

# Data to be stored in the Pandas Series

data = [10, 20, 40, 80, 100]

# Create a Series using the Series() method

s = pd.Series(data)

# Display the Series

print("Series: \n",s)

Output: The 0,1,2,3, etc. are the index numbers i.e. labels.

Series:

0 10

1 20

2 40

3 80

4 100

dtype: int64

Series – Attributes and Methods

let us see such attributes and methods in Python Pandas for Series:

dtype: Return the dtype.
ndim: Return the Number of dimensions
size: Return the number of elements.
name: Return the name of the Series.
hasnans: Returns True if NaNs are in the series.
index: The index of the series
head(): Return the first n rows.
tail(): Return the last n rows.
info(): Display the Summary of the series

Categorical Data

It is a Pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited number of possible values. Examples are gender, blood type, country affiliation, rating, etc.

Create Categorical Series: Use the dtype=”category” while creating a series to create a Categorical Series. Let us see an example:

import pandas as pd

# Creating a Categorical Series

s = pd.Series(["p", "q", "r", "s", "q"], dtype="category")

# Display the Series

print("Series = \n", s)

Output

Series =

0 p

1 q

2 r

3 s

4 q

dtype: category

Categories (4, object): [p, q, r, s]

Create Categorical DataFrame: Use the dtype=”category” while creating a DataFrame to create a Categorical DataFrame. Let us see an example. We have created 3 categories here:

import pandas as pd

# Creating a Categorical DataFrame

df = pd.DataFrame({"Cat1": list("pqrs"), "Cat2": list("pqrp"), "Cat3": list("qrrr")}, dtype="category")

# Display the DataFrame

print("DataFrame = \n", df)

# Display the datatypes

print("\nDataType of each column = \n", df.dtypes)

Output

DataFrame =

Cat1 Cat2 Cat3

0 p p q

1 q q r

2 r r r

3 s p r

DataType of each column =

Cat1 category

Cat2 category

Cat3 category

dtype: object

Working with Categories

Learn how to work with Categories in Pandas:

Append new categories: To append new categories, use the add_categories() method in Python Pandas. Let us see an example:

import pandas as pd

# Creating a Categorical Series

s = pd.Series(["p", "q", "r", "s", "q"], dtype="category")

# Display the Series

print("Series = \n", s)

# Append Category

s = s.cat.add_categories([5])

# Display the updated category

print("\nUpdated Categories = ",s.cat.categories)

Output

Series =

0 p

1 q

2 r

3 s

4 q

dtype: category

Categories (4, object): [p, q, r, s]

Updated Categories = Index(['p', 'q', 'r', 's', 5], dtype='object')

Remove a category:
To remove a category, use the remove_categories() method in Python Pandas. Let us see an example:

import pandas as pd

# Creating a Categorical Series

s = pd.Series(["p", "q", "r", "s", "q"], dtype="category")

# Display the Series

print("Series\n", s)

# Remove a Category

# Display the updated category after removing a specific category

print("\nUpdated Categories\n",s.cat.remove_categories("r"))

Output

Series

0 p

1 q

2 r

3 s

4 q

dtype: category

Categories (4, object): [p, q, r, s]

Updated Categories

0 p

1 q

2 NaN

3 s

4 q

dtype: category

Categories (3, object): [p, q, s]

Read CSV

The read_csv() method is used to reach CSV in Pandas. Let’s say we have a CSV file Students.csv. We will read it now:

import pandas as pd

# Input CSV file

# Loading CSV in the DataFrame

df = pd.read_csv('Students.csv')

# Display the CSV file records

print("Our DataFrame =\n",df)

Output

Student Rank Marks

0 Amit 1 95

1 Virat 2 90

2 David 3 80

3 Will 4 75

4 Steve 5 65

String Operations on Text Data

The following string operations can be performed on data in Pandas:

lower(): Perform lowercase on text data
upper(): Perform uppercase on text data
title(): Convert text data to camel case
len(): To get the length of each element in the Series.
count(): Count the non-empty cells for each column or row
contain(): Search for a value in a column.

Remove Whitespace

To remove whitespace on text data in a Series or DataFrame, use the following methods in Python Pandas:

strip(): Strip whitespace from the left and right
lstrip(): Strip whitespace from only the left side
rstrip(): Strip whitespace from only the right side

Sorting

Sort the DataFrame in Pandas using the sort_values() method:

Sort the Pandas DataFrame: To sort the dataframe, use the sort_values() method. The default is ascending.
Sort the Pandas DataFrame in Descending Order: To sort the dataframe in descending order, use the sort_values() method. Set the ascending parameter of the method to False for descending order sort.

Indexing

Indexing means selecting specific rows and columns of data from DataFrame. A DataFrame includes columns, index, and data. Let us see some examples:

Indexing in Pandas using the indexing operator: We can directly use the [] i.e. the indexing operator in Pandas to retrieve records
Indexing in Pandas using loc[]: To retrieve a single row in Panda, use the loc[] in Pandas.
Indexing in Pandas using iloc[]: To retrieve the rows and columns by position, use the iloc[] in Pandas.

Group the Data

In Pandas, group data in a DataFrame and perform operations on it:

Split the object and combine the result: The groupby() method is used in Pandas to split the object. We can define groupby() as grouping the rows/columns into specific groups.
Iterate the Group: Iterate and loop through the groups with groupby() using the for-in loop.
View the Group: Use the groups property in Python Pandas to view the group.
Perform Aggregation Operations on Groups: After grouping, we can perform operations on the grouped data using the agg() method. Through this method, get mean or even get the size of each group, etc. Let’s see some examples:
- Get the mean of grouped data: To get the mean of the grouped data, first, group and then use the agg() method with numpy.mean().
- Get the size of each group: To get the size of each group, use the Numpy size attribute in Pandas. We have grouped by the Player column using the groupby().

Statistical Functions

We can easily work around statistics operations using the statistical functions in Python Pandas. It can be applied to a Series or DataFrame:

sum(): Return the sum of the values.
count(): Return the count of non-empty values.
max(): Return the maximum of the values.
min(): Return the minimum of the values.
mean(): Return the mean of the values.
median(): Return the median of the values.
std(): Return the standard deviation of the values.
describe(): Return the summary statistics for each column.

Plotting

To plot in Pandas, we need to use the plot() method and the Matplotlib library. The pyplot module from Matplotlib is also used for plotting in Pandas. The pyplot.show() is used to display the figure. Plot:

Histogram: To create a Histogram, set the kind argument of the plot() method to hist. For this, we only need a single column.
Pie Chart: Use the plot.pie() method to draw a Pie Chart
Scatter Plot: Set the kind argument of the plot() method to scatter. For this, we will also set the x-axis and y-axis
Area Plot: Use the plot.area() method to draw an Area Plot.

Find and Remove Duplicates from rows in Pandas

Find Duplicates: To find duplicates from rows in a Pandas DataFrame or Series, use the duplicated() method.
Remove Duplicates: To remove duplicates from rows in a Pandas DataFrame or Series, use the drop_duplicates() method.

Clean the Data

Cleaning the data in Pandas means working on the incorrect data to fix it. This incorrect data can empty data, null, duplicate data, etc. The following are the functions to clean the data:

isnull(): Find the NULL values and replace them with True.
notnull(): Find the NOT NULL values and replace them with True.
df.dropna(): Drop rows with NULL values.
df.fillna(x): Replace NULL values with a specific value

What’s next?

After completing Pandas, follow the below tutorials and learn Python Libraries:

If you liked the tutorial, spread the word and share the link and our website Studyopedia with others.

For Videos, Join Our YouTube Channel: Join Now

Print page

1 Like

Studyopedia Editorial Staff

contact@studyopedia.com

We work to create programming tutorials for all.

05 Aug Pandas Cheat Sheet

Introduction

Features

Installation

DataFrames in Pandas

DataFrame – Attributes and Methods

Series in Pandas

Series – Attributes and Methods

Categorical Data

Working with Categories

Read CSV

String Operations on Text Data

Remove Whitespace

Sorting

Indexing

Group the Data

Statistical Functions

Plotting

Find and Remove Duplicates from rows in Pandas

Clean the Data

What’s next?

Studyopedia Editorial Staff

No Comments

Post A Comment

Tutorials

Cheat Sheet

Quiz

Interview Questions & Answers