The drop()
function in pandas is an essential tool for data manipulation, allowing you to efficiently remove rows or columns from a DataFrame. This capability is crucial when preparing data for analysis, where you might need to exclude irrelevant, missing, or duplicate data to improve the quality of your data set.
In this article, you will learn how to master the drop()
function in pandas. You'll explore various scenarios including dropping multiple columns, rows by index, and conditioning deletions on data values. This guide will help you use this function to streamline your data preprocessing workflows.
Start with a DataFrame containing several columns.
Specify the column names you want to drop.
import pandas as pd
data = {
'Name': ['John', 'Ana', 'Peter', 'Linda'],
'Age': [28, 22, 34, 42],
'City': ['New York', 'Los Angeles', 'Berlin', 'London'],
'Occupation': ['Engineer', 'Artist', 'Doctor', 'Lawyer']
}
df = pd.DataFrame(data)
df = df.drop(columns=['Age', 'City'])
print(df)
This snippet creates a DataFrame and then uses drop()
to eliminate the 'Age' and 'City' columns. The result contains only the 'Name' and 'Occupation' columns.
Understand that the axis
parameter specifies whether you're dropping labels from the index (0 or 'index') or columns (1 or 'columns').
Apply the parameter to drop a single column.
df = pd.DataFrame(data)
df = df.drop('Occupation', axis=1)
print(df)
By setting axis=1
, the operation knows to look for 'Occupation' in the columns, removing it from the DataFrame.
Identify the indices of the rows you wish to remove from your DataFrame.
Use the drop()
function to remove these rows.
df = pd.DataFrame(data)
df = df.drop([0, 1])
print(df)
Here, rows with indices 0 and 1 (John
and Ana
) are removed, leaving only the latter entries.
Drop rows based on a condition applied to the DataFrame.
Use boolean indexing to specify the condition and drop()
to remove the rows.
df = pd.DataFrame(data)
df = df[df['Age'] > 30].drop(['Peter', 'Linda'])
print(df)
This removes rows where 'Age' is 30 or less. It then attempts to drop rows labeled 'Peter' and 'Linda' directly, but notice a mistake: the correct index or labels are needed for successful deletion.
Realize that the inplace
parameter dictates whether to return a new DataFrame or modify the existing one.
Use inplace=True
to alter the DataFrame directly.
df = pd.DataFrame(data)
df.drop('City', axis=1, inplace=True)
print(df)
Setting inplace=True
alters the original df
by removing the 'City' column without needing to reassign the DataFrame.
drop()
.axis
parameter properly.axis
correctly can result in targeting the wrong data dimension.Mastering the drop()
function in pandas sharpens your data cleaning skills significantly. Whether removing unneeded columns, filtering out rows by specific criteria, or managing DataFrame dimensions dynamically, knowing how to use drop()
effectively ensures you maintain clean and relevant data sets. Implement these strategies in your data processing tasks to boost efficiency and clarity in your data analysis projects.