Python Pandas DataFrame duplicated() - Identify Duplicates

Updated on December 31, 2024
duplicated() header image

Introduction

The duplicated() method in Python's Pandas library is a highly useful tool for identifying duplicate rows in a DataFrame. This function makes data cleaning processes simpler, especially when dealing with large datasets. Detecting duplicates often precedes data preprocessing tasks, ensuring the quality and reliability of your data before performing any analysis.

In this article, you will learn how to use the duplicated() method effectively with various examples. Explore how to identify duplicate records in a Pandas DataFrame based on different criteria such as all columns, specific columns, or considering all duplicates or the first occurrence only.

Identifying Duplicate Rows in a DataFrame

Basic Usage of duplicated()

  1. Import the Pandas library and create a sample DataFrame.

  2. Use duplicated() to find duplicate rows based on all columns.

    python
    import pandas as pd
    data = {'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'Charlie'],
            'Age': [25, 30, 35, 30, 35],
            'City': ['New York', 'Boston', 'San Francisco', 'Boston', 'San Francisco']}
    df = pd.DataFrame(data)
    duplicate_rows = df.duplicated()
    print(duplicate_rows)
    

    This code snippet creates a DataFrame from a dictionary of lists and then applies the duplicated() function. The result is a Boolean Series indicating whether each row is a duplicate of a row that occurred earlier in the DataFrame.

Specifying Columns for Duplicate Check

  1. Focus on specific columns to determine duplicity.

  2. Utilize the subset parameter in duplicated().

    python
    duplicate_specific = df.duplicated(subset=['Name', 'City'])
    print(duplicate_specific)
    

    By specifying the subset parameter, the method checks for duplicates considering only the specified columns. In the above example, duplicates are identified based on 'Name' and 'City' columns.

Handling First Occurrence and All Duplicates

  1. Adjust how the first occurrence of duplicates is considered using the keep parameter.

  2. Set keep to different values to modify the behavior.

    python
    # Mark all duplicates as True, the first occurrence as False
    all_dupes = df.duplicated(keep=False)
    print(all_dupes)
    
    # Mark duplicates as True except for the last occurrence
    last_not_dupe = df.duplicated(keep='last')
    print(last_not_dupe)
    

    When keep='False', all duplicates including the first are marked as True. Setting keep='last' marks all duplicates as True except the last occurrence.

Using duplicated() with Real-World Data

Example: Data Processing for Sales Records

  1. Consider a dataset of sales records with potential duplicate entries.

  2. Apply the duplicated() function to clean the data.

    python
    sales_data = {
        'OrderID': [101, 102, 103, 104, 101],
        'Product': ['Widget', 'Gadget', 'Doohickey', 'Thingamabob', 'Widget'],
        'Quantity': [1, 2, 1, 3, 1]
    }
    
    sales_df = pd.DataFrame(sales_data)
    print("Original DataFrame:\n", sales_df)
    print("\nDuplicates (considering OrderID and Product):\n", sales_df.duplicated(subset=['OrderID', 'Product']))
    

    This script identifies duplicates considering 'OrderID' and 'Product' to ensure that there are no repeated orders prior to further data analysis.

Conclusion

The duplicated() function from Pandas provides a straightforward approach for identifying and managing duplicate entries in DataFrames. Understanding and utilizing this function is crucial for data cleansing, preparation, and ensuring the integrity of your dataset. By following the examples and techniques presented, you enable better data handling and pave the way for accurate data-driven decisions. Always consider how duplicate data may impact your analyses and use duplicated() judiciously to maintain a clean dataset.