05 Aug Pandas Cheat Sheet
Pandas Cheat Sheet will guide you to work on Pandas with basics and advanced topics. Cheat Sheet for students, engineers, and professionals.
Introduction
Pandas is a powerful and easy-to-use open-source tool built on top of the Python programming language. It is useful for data analysis and manipulation. Python with pandas is widely used in Statistics, Finance, Neuroscience, Economics, Web Analytics, Advertising, etc.
Features
The following are the features of the Pandas Library:
- Analyze Data
- Manipulate Data
- Group the rows/ columns of a DataFrame/ Series
- Plotting is possible
- Fix the inaccurate data
- Clean the Data completely
Installation
To install Pandas, use the PIP package manager. Install Python and PIP, and then use PIP to install the Pandas Python library:
1 2 3 |
pip install pandas |
DataFrames in Pandas
The Pandas DataFrame is a Two-dimensional, tabular data, table with rows and columns. The DataFrame() method is used for this purpose and has the following parameters:
- data: The data to be stored in the Pandas DataFrame
- index: The index values to be provided for the resultant frame.
- columns: Set the column labels for the resultant frame if data does not mention before
- dtype: It is the datatype and only a single type is allowed.
- copy: To copy the input data
Let us see how to create a Pandas DataFrame:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
import pandas as pd # Dataset data = { 'student': ["Amit", "John", "Jacob", "David", "Steve"], 'rank': [1, 4, 3, 5, 2], 'marks': [95, 70, 80, 60, 90] } res = pd.DataFrame(data) print("Student Records\n\n",res) |
Output
1 2 3 4 5 6 7 8 9 10 |
Student Records student rank marks 0 Amit 1 95 1 John 4 70 2 Jacob 3 80 3 David 5 60 4 Steve 2 90 |
DataFrame – Attributes and Methods
Let us see such attributes and methods in Python Pandas for DataFrame:
- dtypes: Return the dtypes in the DataFrame
- ndim: Return the number of dimensions of the DataFrame
- size: Return the number of elements in the DataFrame.
- shape: Return the dimensionality of the DataFrame in the form of a tuple.
- index: Return the index of the DataFrame
- T: Transpose the rows and columns
- head(): Return the first n rows.
- tail(): Return the last n rows.
Series in Pandas
Series in Pandas is a one-dimensional array, like a column in a table. It is a labeled array that can hold data of any type. The Series() method is used for this and has the following parameters:
- data: The data to be stored in the Pandas Series
- index: The index values should have the same length as the data.
- dtype: It is the datatype for the output Series.
- name: Set the series name with the name parameter
- copy: To copy the input data
Let us now see an example to create a Pandas Series:
1 2 3 4 5 6 7 8 9 10 11 12 |
import pandas as pd # Data to be stored in the Pandas Series data = [10, 20, 40, 80, 100] # Create a Series using the Series() method s = pd.Series(data) # Display the Series print("Series: \n",s) |
Output: The 0,1,2,3, etc. are the index numbers i.e. labels.
1 2 3 4 5 6 7 8 9 |
Series: 0 10 1 20 2 40 3 80 4 100 dtype: int64 |
Series – Attributes and Methods
let us see such attributes and methods in Python Pandas for Series:
- dtype: Return the dtype.
- ndim: Return the Number of dimensions
- size: Return the number of elements.
- name: Return the name of the Series.
- hasnans: Returns True if NaNs are in the series.
- index: The index of the series
- head(): Return the first n rows.
- tail(): Return the last n rows.
- info(): Display the Summary of the series
Categorical Data
It is a Pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited number of possible values. Examples are gender, blood type, country affiliation, rating, etc.
- Create Categorical Series: Use the dtype=”category” while creating a series to create a Categorical Series. Let us see an example:
123456789import pandas as pd# Creating a Categorical Seriess = pd.Series(["p", "q", "r", "s", "q"], dtype="category")# Display the Seriesprint("Series = \n", s)
Output
12345678910Series =0 p1 q2 r3 s4 qdtype: categoryCategories (4, object): [p, q, r, s] - Create Categorical DataFrame: Use the dtype=”category” while creating a DataFrame to create a Categorical DataFrame. Let us see an example. We have created 3 categories here:
123456789101112import pandas as pd# Creating a Categorical DataFramedf = pd.DataFrame({"Cat1": list("pqrs"), "Cat2": list("pqrp"), "Cat3": list("qrrr")}, dtype="category")# Display the DataFrameprint("DataFrame = \n", df)# Display the datatypesprint("\nDataType of each column = \n", df.dtypes)
Output1234567891011121314DataFrame =Cat1 Cat2 Cat30 p p q1 q q r2 r r r3 s p rDataType of each column =Cat1 categoryCat2 categoryCat3 categorydtype: object
Working with Categories
Learn how to work with Categories in Pandas:
- Append new categories: To append new categories, use the add_categories() method in Python Pandas. Let us see an example:
123456789101112131415import pandas as pd# Creating a Categorical Seriess = pd.Series(["p", "q", "r", "s", "q"], dtype="category")# Display the Seriesprint("Series = \n", s)# Append Categorys = s.cat.add_categories([5])# Display the updated categoryprint("\nUpdated Categories = ",s.cat.categories)
Output123456789101112Series =0 p1 q2 r3 s4 qdtype: categoryCategories (4, object): [p, q, r, s]Updated Categories = Index(['p', 'q', 'r', 's', 5], dtype='object') - Remove a category:
To remove a category, use the remove_categories() method in Python Pandas. Let us see an example:
12345678910111213import pandas as pd# Creating a Categorical Seriess = pd.Series(["p", "q", "r", "s", "q"], dtype="category")# Display the Seriesprint("Series\n", s)# Remove a Category# Display the updated category after removing a specific categoryprint("\nUpdated Categories\n",s.cat.remove_categories("r"))
Output
12345678910111213141516171819Series0 p1 q2 r3 s4 qdtype: categoryCategories (4, object): [p, q, r, s]Updated Categories0 p1 q2 NaN3 s4 qdtype: categoryCategories (3, object): [p, q, s]
Read CSV
The read_csv() method is used to reach CSV in Pandas. Let’s say we have a CSV file Students.csv. We will read it now:
1 2 3 4 5 6 7 8 9 10 |
import pandas as pd # Input CSV file # Loading CSV in the DataFrame df = pd.read_csv('Students.csv') # Display the CSV file records print("Our DataFrame =\n",df) |
Output
1 2 3 4 5 6 7 8 |
Student Rank Marks 0 Amit 1 95 1 Virat 2 90 2 David 3 80 3 Will 4 75 4 Steve 5 65 |
String Operations on Text Data
The following string operations can be performed on data in Pandas:
- lower(): Perform lowercase on text data
- upper(): Perform uppercase on text data
- title(): Convert text data to camel case
- len(): To get the length of each element in the Series.
- count(): Count the non-empty cells for each column or row
- contain(): Search for a value in a column.
Remove Whitespace
To remove whitespace on text data in a Series or DataFrame, use the following methods in Python Pandas:
- strip(): Strip whitespace from the left and right
- lstrip(): Strip whitespace from only the left side
- rstrip(): Strip whitespace from only the right side
Sorting
Sort the DataFrame in Pandas using the sort_values() method:
- Sort the Pandas DataFrame: To sort the dataframe, use the sort_values() method. The default is ascending.
- Sort the Pandas DataFrame in Descending Order: To sort the dataframe in descending order, use the sort_values() method. Set the ascending parameter of the method to False for descending order sort.
Indexing
Indexing means selecting specific rows and columns of data from DataFrame. A DataFrame includes columns, index, and data. Let us see some examples:
- Indexing in Pandas using the indexing operator: We can directly use the [] i.e. the indexing operator in Pandas to retrieve records
- Indexing in Pandas using loc[]: To retrieve a single row in Panda, use the loc[] in Pandas.
- Indexing in Pandas using iloc[]: To retrieve the rows and columns by position, use the iloc[] in Pandas.
Group the Data
In Pandas, group data in a DataFrame and perform operations on it:
- Split the object and combine the result: The groupby() method is used in Pandas to split the object. We can define groupby() as grouping the rows/columns into specific groups.
- Iterate the Group: Iterate and loop through the groups with groupby() using the for-in loop.
- View the Group: Use the groups property in Python Pandas to view the group.
- Perform Aggregation Operations on Groups: After grouping, we can perform operations on the grouped data using the agg() method. Through this method, get mean or even get the size of each group, etc. Let’s see some examples:
- Get the mean of grouped data: To get the mean of the grouped data, first, group and then use the agg() method with numpy.mean().
- Get the size of each group: To get the size of each group, use the Numpy size attribute in Pandas. We have grouped by the Player column using the groupby().
Statistical Functions
We can easily work around statistics operations using the statistical functions in Python Pandas. It can be applied to a Series or DataFrame:
- sum(): Return the sum of the values.
- count(): Return the count of non-empty values.
- max(): Return the maximum of the values.
- min(): Return the minimum of the values.
- mean(): Return the mean of the values.
- median(): Return the median of the values.
- std(): Return the standard deviation of the values.
- describe(): Return the summary statistics for each column.
Plotting
To plot in Pandas, we need to use the plot() method and the Matplotlib library. The pyplot module from Matplotlib is also used for plotting in Pandas. The pyplot.show() is used to display the figure. Plot:
- Histogram: To create a Histogram, set the kind argument of the plot() method to hist. For this, we only need a single column.
- Pie Chart: Use the plot.pie() method to draw a Pie Chart
- Scatter Plot: Set the kind argument of the plot() method to scatter. For this, we will also set the x-axis and y-axis
- Area Plot: Use the plot.area() method to draw an Area Plot.
Find and Remove Duplicates from rows in Pandas
- Find Duplicates: To find duplicates from rows in a Pandas DataFrame or Series, use the duplicated() method.
- Remove Duplicates: To remove duplicates from rows in a Pandas DataFrame or Series, use the drop_duplicates() method.
Clean the Data
Cleaning the data in Pandas means working on the incorrect data to fix it. This incorrect data can empty data, null, duplicate data, etc. The following are the functions to clean the data:
- isnull(): Find the NULL values and replace them with True.
- notnull(): Find the NOT NULL values and replace them with True.
- df.dropna(): Drop rows with NULL values.
- df.fillna(x): Replace NULL values with a specific value
What’s next?
If you liked the tutorial, spread the word and share the link and our website Studyopedia with others.
For Videos, Join Our YouTube Channel: Join Now
No Comments