Pandas – Cleaning the Data

To clean the data in Python, we have some built-in functions. In this lesson, we will understand them one by one with examples. Cleaning the data in Pandas means working on the incorrect data to fix it. This incorrect data can empty data, null, duplicate data, etc.


Read: Handle Duplicate Data in Pandas


Let’s say we have the following CSV file demo.csv. The data consists of some null values:

Demo CSV file in Pandas to clean the data

Clean the data

Let us now work around the functions to clean the data:

  1. isnull(): Find the NULL values and replace them with True.
  2. notnull(): Find the NOT NULL values and replace them with True.
  3. df.dropna(): Drop rows with NULL values.
  4. df.fillna(x): Replace NULL values with a specific value

1. Pandas isnull() method

The isnull() method in Pandas is used to find the NULL values and replace them with True. For non-NULL values, False is returned. Let us see an example:

import pandas as pd

# Input CSV file
df = pd.read_csv(r"C:\Users\hp\Desktop\demo.csv")

# Display the CSV file records
print("Our DataFrame\n",df)

# Find and Replace Null with True
resdf = df.isnull()

# Return the new DataFrame
print("\nNew DataFrame \n",resdf.to_string())

Output

Our DataFrame
    Frequency  Points
0        2.4    83.5
1        3.2    21.6
2        6.1     NaN
3        1.2    45.9
4        2.9    19.3
5        3.8    23.9
6        4.5     NaN
7        8.3    66.3
8        7.9    74.7
9        5.8    67.5

New DataFrame
    Frequency  Points
0      False   False
1      False   False
2      False    True
3      False   False
4      False   False
5      False   False
6      False    True
7      False   False
8      False   False
9      False   False

2. Pandas notnull() method

The notnull() method in Pandas is used to find the NOT NULL values and replace them with True. For NULL values, False is returned. Let us see an example:

import pandas as pd

# Input CSV file
df = pd.read_csv(r"C:\Users\hp\Desktop\demo.csv")

# Display the CSV file records
print("Our DataFrame\n",df)

# Find and Replace NOT NULL values with True
resdf = df.notnull()

# Return the new DataFrame
print("\nNew DataFrame\n",resdf.to_string())

Output

Our DataFrame
    Frequency  Points
0        2.4    83.5
1        3.2    21.6
2        6.1     NaN
3        1.2    45.9
4        2.9    19.3
5        3.8    23.9
6        4.5     NaN
7        8.3    66.3
8        7.9    74.7
9        5.8    67.5

New DataFrame
    Frequency  Points
0       True    True
1       True    True
2       True   False
3       True    True
4       True    True
5       True    True
6       True   False
7       True    True
8       True    True
9       True    True

3. Pandas dropna() method

The dropna() method in Pandas removes rows with null values. Let us see an example:

import pandas as pd

# Input CSV file
df = pd.read_csv(r"C:\Users\hp\Desktop\demo.csv")

# Display the CSV file records
print("Our DataFrame\n",df)

# Find and remove rows with NULL value
resdf = df.dropna()

# Return the new DataFrame
print("\nNew DataFrame (after removing rows with NULL)\n",resdf.to_string())

Output

Our DataFrame
    Frequency  Points
0        2.4    83.5
1        3.2    21.6
2        6.1     NaN
3        1.2    45.9
4        2.9    19.3
5        3.8    23.9
6        4.5     NaN
7        8.3    66.3
8        7.9    74.7
9        5.8    67.5

New DataFrame (after removing rows with NULL)
    Frequency  Points
0        2.4    83.5
1        3.2    21.6
3        1.2    45.9
4        2.9    19.3
5        3.8    23.9
7        8.3    66.3
8        7.9    74.7
9        5.8    67.5

4. Pandas fillna() method

The fillna() method in Pandas is used to replace NULL values with a specific value. Let us see an example:

import pandas as pd

# Input CSV file
df = pd.read_csv(r"C:\Users\hp\Desktop\demo.csv")

# Display the CSV file records
print("Our DataFrame\n",df)

# Find and replace NULL values with a specific value 111
resdf = df.fillna(111)

# Return the new DataFrame
print("\nNew DataFrame (after replacing NULL with a specific value)\n",resdf.to_string())

Output

Our DataFrame
    Frequency  Points
0        2.4    83.5
1        3.2    21.6
2        6.1     NaN
3        1.2    45.9
4        2.9    19.3
5        3.8    23.9
6        4.5     NaN
7        8.3    66.3
8        7.9    74.7
9        5.8    67.5

New DataFrame (after replacing NULL with a specific value)
    Frequency  Points
0        2.4    83.5
1        3.2    21.6
2        6.1   111.0
3        1.2    45.9
4        2.9    19.3
5        3.8    23.9
6        4.5   111.0
7        8.3    66.3
8        7.9    74.7
9        5.8    67.5

Techniques for imputing or masking data

Here are specific techniques for imputing or masking data, which allow you to handle missing values without simply deleting them.

Imputation is the process of replacing missing or NULL values with “educated guesses” to keep your dataset complete. It is a critical step in data cleaning because many machine learning models cannot process data with gaps.

    • Goal: To improve data quality and prevent the loss of valuable information that would occur if you simply deleted every row with a missing value.
    • Common Techniques:
        • Mean/Median/Mode: Filling gaps with the average or most frequent value.
        • Forward/Backward Fill: Using the previous or next available value, often used in time-series data.
        • K-Nearest Neighbors (KNN): Using similar data points to predict the missing value.

Data Masking (Protecting Sensitive Data) is a security technique that hides or disguises real data (like credit card numbers or names) with realistic-looking but fake data. This allows developers and analysts to work with the data without seeing private information.
  • Goal: To protect Personally Identifiable Information (PII) and comply with privacy laws like GDPR or HIPAA.
  • Common Techniques:
    • Substitution: Replacing real names with random fake names from a list.
    • Shuffling: Swapping values between rows (e.g., mixing up all phone numbers in a column so they no longer match the correct person).
    • Redaction (Masking Out): Replacing characters with symbols, such as showing only the last four digits of a credit card: XXXX-XXXX-XXXX-1234

Let us now see some examples:

  1. Find the NULL values and replace only NULL with True
  2. Replace NULL values with a specific string value
  3. Fill every numeric column’s NaN values with its respective column mean
  4. Fill missing values in only one column with its mean
  5. Fix the date formats

Our data is now an Excel file with unclean data. The file name path is the following:

C:\Users\hp\Desktop\Unclean_Superstore_Data.xlsx

Here is the screenshot of the Excel file. We have marked some NULL values from the unclean excel file:

Unclean data to clean

Let us now see the examples one by one:

5. Find the NULL values and replace only NULL with True

We will find only the NULL values and replace only NULL with TRUE using where(). Let us see an example:

# Find the NULL values and replace only NULL with True

import pandas as pd

# input excel file
# Load the excel in the DataFrame
# df = pd.read_excel('C:\\Users\\hp\\Desktop\\Unclean_Superstore_Data.xlsx')
df = pd.read_excel(r'C:\Users\hp\Desktop\Unclean_Superstore_Data.xlsx')

print("Our DataFrame =\n",df)

# Find the null values and replace it with TRUE
res = df.isnull().where(df.isnull(), df)

print(res.to_string())

# To export the result in a new excel file, use the toexcel() method
# The to_excel() will write the updated dataframe into the excel
res.to_excel(r'C:\Users\hp\Desktop\Demo.xlsx')

print("\n-- Resultant DataFrame exported to Demo.xlsx excel file")

Output

Replacing true for null values

Above, we have marked the cells replaced with True because those were NULL values. The dataframe output was exported to Demo.xlsx using the to_excel() method in Pandas. Therefore, a new excel file Demo.xlsx generated as output.

6. Replace NULL values with a specific string value

This is useful for categorical columns where you want to label missing data as “Missing” or “Unknown” instead of leaving it blank. Follow the syntax:

df.fillna("Your String Here

Let us now see an example:

# Replace NULL values with a specific string value

import pandas as pd

# input excel file
# Load the excel in the DataFrame
# df = pd.read_excel('C:\\Users\\hp\\Desktop\\Unclean_Superstore_Data.xlsx')
df = pd.read_excel(r'C:\Users\hp\Desktop\Unclean_Superstore_Data.xlsx')

print("Our DataFrame =\n",df)

# Find and replace NULL values with a specific value "Missing"
res = df.fillna("Missing")

# To export the result in a new excel file, use the toexcel() method
# The to_excel() will write the updated dataframe into the excel
res.to_excel(r'C:\Users\hp\Desktop\Demo2.xlsx')

print("\n-- Resultant DataFrame exported to Demo2.xlsx excel file")

Output

Replace NULL values with a specific string

Above, we have marked the cells replaced with “b” because those were NULL values. The dataframe output was exported to Demo2.xlsx using the to_excel() method in Pandas. Therefore, a new excel file Demo2.xlsx generated as output.

7. Fill every numeric column’s NaN values with its respective column mean

This is a common statistical fix. It calculates the average for each numeric column and fills its specific gaps with that average. We will use the numeric_only value of the mean():

Let us see an example:

# Fill every numeric column's NaN values with its respective column mean

import pandas as pd

# input excel file
# Load the excel in the DataFrame
# df = pd.read_excel('C:\\Users\\hp\\Desktop\\Unclean_Superstore_Data.xlsx')
df = pd.read_excel(r'C:\Users\hp\Desktop\Unclean_Superstore_Data.xlsx')

print("Our DataFrame =\n",df)

# Fill every numeric column with its respective column mean
res = df.fillna(df.mean(numeric_only=True))

# To export the result in a new excel file, use the toexcel() method
# The to_excel() will write the updated dataframe into the excel
res.to_excel(r'C:\Users\hp\Desktop\Demo3.xlsx')

print("\n-- Resultant DataFrame exported to Demo3.xlsx excel file")

Output

Fill every numeric column's NaN values with its respective column mean

Above, we have marked the cells replaced with the mean of their respective columns because those were NULL values (in a numeric column). The dataframe output was exported to Demo3.xlsx using the to_excel() method in Pandas. Therefore, a new Excel file, Demo3.xlsx, was generated as output.

Only one numeric column, Sales, had missing values. The above would have implemented the same for any other numeric columns with missing values.

8. Fill missing values in only one column with its mean

Calculate the average for a specific numeric column and fill its specific gaps with that average. Above, we saw how to calculate the average for each numeric column and fill its specific gaps with that average.

# Fill missing values in only one column with its mean

import pandas as pd

# input excel file
# Load the excel in the DataFrame
# df = pd.read_excel('C:\\Users\\hp\\Desktop\\Unclean_Superstore_Data.xlsx')
df = pd.read_excel(r'C:\Users\hp\Desktop\Unclean_Superstore_Data.xlsx')

print("Our DataFrame =\n",df)

# Fill missing values in only one column with its mean
res = df['Sales'].fillna(df['Sales'].mean())
print("Our DataFrame =\n",res)

# To export the result in a new excel file, use the toexcel() method
# The to_excel() will write the updated dataframe into the excel
res.to_excel(r'C:\Users\hp\Desktop\Demo4.xlsx')

print("\n-- Resultant DataFrame exported to Demo4.xlsx excel file")

Output

Fill missing values in only one column with its mean

Above, we have marked the cells replaced with the mean of the Sales column because those were NULL values (in only a single column). The dataframe output was exported to Demo4.xlsx using the to_excel() method in Pandas. Therefore, a new Excel file, Demo3.xlsx, was generated as output.

Use this when you only want to fix one specific column (like “Sales“) while leaving others untouched.

9. Fix the date formats

In Excel, dates are often stored as numbers (e.g., 44564). Excel stores dates as the number of days elapsed since January 1, 1990. Therefore, the number 44564, represents the 44564th day after January 1, 1900.

We need to convert these into a readable YYYY-MM-DD format. In our Unclean_Superstore_Data.xlsx, the date formats are stored as numbers, as shown below:

Incorrect data format in an excel

Let’s see how to fix:

# Fix date formats

import pandas as pd

# input excel file
# Load the excel in the DataFrame
# df = pd.read_excel('C:\\Users\\hp\\Desktop\\Unclean_Superstore_Data.xlsx')
df = pd.read_excel(r'C:\Users\hp\Desktop\Unclean_Superstore_Data.xlsx')

print("Our DataFrame =\n",df)

# Convert Excel serial numbers to datetime objects
# The origin '1899-12-30' is the standard for Excel-to-Pandas conversion
df['Order Date'] = pd.to_datetime(df['Order Date'], unit='D', origin='1899-12-30')
df['Ship Date'] = pd.to_datetime(df['Ship Date'], unit='D', origin='1899-12-30')

print("Our Updated DataFrame =\n",df)

# To export the result in a new excel file, use the to_excel() method
# The to_excel() will write the updated dataframe into the excel
df.to_excel(r'C:\Users\hp\Desktop\Demo5.xlsx')

print("\n-- Resultant DataFrame exported to Demo5.xlsx excel file")

Output

The updated dataframe gets exported in Demo5.xlsx because we have used the dataframe.to_excel():
Fixed the dates in an excel with Pandas


If you liked the tutorial, spread the word and share the link and our website, Studyopedia, with others:


For Videos, Join Our YouTube Channel: Join Now


Read More:

Indexing in Pandas
Pandas - Group the Data
Studyopedia Editorial Staff
contact@studyopedia.com

We work to create programming tutorials for all.

No Comments

Post A Comment