Python Pandas merge() - Merge DataFrames

Updated on December 27, 2024
merge() header image

Introduction

The merge() function in Python's Pandas library is a fundamental tool for combining data from multiple DataFrames, much like SQL joins. This capability is crucial when you're dealing with relational data, enabling you to bring together information from different sources efficiently and intuitively.

In this article, you will learn how to use the merge() function to join two DataFrames. Discover various join types, such as inner, outer, left, and right joins, and learn how to specify which columns to join on. This knowledge will help you handle data merging tasks with confidence and precision.

Understanding Basic DataFrame Merging

Performing an Inner Join

  1. Create two DataFrames with common and unique columns.

  2. Merge them using the merge() function by specifying the type of join.

    python
    import pandas as pd
    
    df1 = pd.DataFrame({
        'key': ['A', 'B', 'C', 'D'],
        'value_df1': [1, 2, 3, 5]
    })
    
    df2 = pd.DataFrame({
        'key': ['B', 'D', 'E', 'F'],
        'value_df2': [2, 4, 6, 8]
    })
    
    merged_inner = pd.merge(df1, df2, on='key', how='inner')
    print(merged_inner)
    

    This code merges df1 and df2 on the common column 'key' with an inner join. The result is a new DataFrame that includes only rows that have matching values in the 'key' column from both DataFrames.

Performing Outer, Left, and Right Joins

  1. Apply different types of joins to understand their behavior.

  2. Adjust the how parameter in the merge() function to change the join type.

    python
    merged_outer = pd.merge(df1, df2, on='key', how='outer')
    merged_left = pd.merge(df1, df2, on='key', how='left')
    merged_right = pd.merge(df1, df2, on='key', how='right')
    
    print("Outer Join:\n", merged_outer)
    print("Left Join:\n", merged_left)
    print("Right Join:\n", merged_right)
    
    • Outer join: Combines all rows from both DataFrames, filling in NaN where there are no matches.
    • Left join: Includes all rows from df1 and the matched rows from df2, filling in NaN in places of non-match.
    • Right join: Includes all rows from df2 and the matched rows from df1, filling in NaN in places of non-match.

Advanced Merging Techniques

Specifying Multiple Columns to Join On

  1. Create DataFrames with multiple common columns.

  2. Use merge() by specifying a list of columns.

    python
    df3 = pd.DataFrame({
        'key1': ['A', 'B', 'C'],
        'key2': ['X', 'Y', 'Z'],
        'value_df3': [1, 2, 3]
    })
    
    df4 = pd.DataFrame({
        'key1': ['B', 'C', 'D'],
        'key2': ['Y', 'Z', 'X'],
        'value_df4': [4, 5, 6]
    })
    
    merged_multiple_keys = pd.merge(df3, df4, on=['key1', 'key2'], how='inner')
    print(merged_multiple_keys)
    

    Here, df3 and df4 are merged on two columns, key1 and key2. The inner join type results in a DataFrame that includes only the rows with matching values in both specified columns from both DataFrames.

Using left_on and right_on for Mismatched Column Names

  1. Handle cases where the join columns have different names in the DataFrames.

  2. Specify left_on and right_on in the merge() function.

    python
    df5 = pd.DataFrame({
        'lkey': ['A', 'B', 'C'],
        'value': [1, 2, 3]
    })
    
    df6 = pd.DataFrame({
        'rkey': ['A', 'B', 'D'],
        'value': [4, 5, 6]
    })
    
    merged_different_keys = pd.merge(df5, df6, left_on='lkey', right_on='rkey', how='inner')
    print(merged_different_keys)
    

    This example demonstrates merging df5 and df6 where the keys have different names in each DataFrame, utilizing left_on and right_on to specify the corresponding columns.

Conclusion

The merge() function in Python’s Pandas library is an essential tool for combining DataFrames in various ways, mimicking the behavior of SQL joins. Using this function, you can effectively manage and analyze relational data by performing inner, outer, left, and right joins, as well as handling more complex scenarios involving multiple keys or mismatched column names. By mastering these techniques, you elevate your data manipulation skills, making data analysis tasks more streamlined and insightful.