Python Pandas DataFrame drop_duplicates() - Remove Duplicates

Updated on December 31, 2024
drop_duplicates() header image

Introduction

The drop_duplicates() method in Pandas is a vital tool when working with DataFrame objects, especially in data pre-processing tasks. This function simplifies the process of identifying and removing duplicate records from a DataFrame, ensuring that the data you work with is unique and representative of the real world scenarios. Duplicate entries can skew results and lead to inaccurate analyses, which is why this function is essential for data scientists and analysts.

In this article, you will learn how to effectively employ the drop_duplicates() method in a variety of contexts. Explore how to remove duplicates from DataFrame columns, how to handle duplicates based on certain conditions, and how to utilize this function to maintain data integrity across different datasets.

Understanding drop_duplicates()

Basic Usage of drop_duplicates()

  1. Begin with a sample DataFrame that includes duplicate records.

  2. Apply the drop_duplicates() method without any additional parameters.

    python
    import pandas as pd
    
    data = {
        'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
        'Age': [25, 30, 35, 25]
    }
    df = pd.DataFrame(data)
    unique_df = df.drop_duplicates()
    
    print(unique_df)
    

    The DataFrame df initially contains duplicate rows for 'Alice'. When drop_duplicates() is called, Pandas removes the duplicate row automatically, resulting in unique_df.

Drop Duplicates with Specific Columns

  1. Specify columns where duplicates need to be detected.

  2. Use the subset parameter of the drop_duplicates() to focus on these columns.

    python
    unique_df = df.drop_duplicates(subset=['Name'])
    
    print(unique_df)
    

    In this snippet, duplicates are determined based on the 'Name' column alone. Therefore, only the first occurrence of 'Alice' is kept, and her second entry is removed, even though the 'Age' is the same.

Controlling Retention of Duplicates

  1. Decide which duplicate to keep using the keep parameter.

  2. Utilize first, last, or False values to manage the duplicates effectively.

    python
    unique_df_first = df.drop_duplicates(subset=['Name'], keep='first')
    unique_df_last = df.drop_duplicates(subset=['Name'], keep='last')
    unique_df_none = df.drop_duplicates(subset=['Name'], keep=False)
    
    print(unique_df_first)
    print(unique_df_last)
    print(unique_df_none)
    

    The first option keeps the first occurrence, while last retains the last. Setting keep to False removes all duplicates, including the first/last occurrence, returning a DataFrame with completely unique rows based on the specified subset.

Advanced Usage of drop_duplicates()

Handling Duplicates in Multi-Index DataFrames

  1. Create a DataFrame with a hierarchical index (MultiIndex) that includes potential duplicates.

  2. Apply the drop_duplicates() method on a Multi-Index DataFrame.

    python
    arrays = [['a', 'b', 'b', 'a'], ['one', 'one', 'two', 'two']]
    index = pd.MultiIndex.from_arrays(arrays, names=('Letter', 'Number'))
    data = {'Score': [20, 21, 20, 19]}
    df_multi = pd.DataFrame(data, index=index)
    unique_df_multi = df_multi.drop_duplicates()
    
    print(unique_df_multi)
    

    This example features a MultiIndex where duplicates might be less obvious due to the complex structure. drop_duplicates() still effectively identifies and removes rows where the data in 'Score' is replicated under the same index conditions.

Conclusion

The drop_duplicates() function in Python's Pandas library is a powerful tool for removing duplicate entries from a DataFrame. Mastering this function allows you to maintain the purity and accuracy of your data, which is crucial for reliable analysis. From basic to complex data structures, understanding how to harness this method enhances your proficiency in data manipulation and preprocessing tasks. Use the techniques discussed here to ensure that your dataframes remain unique and representative, thus solidifying the foundation of your analytical projects.