The merge()
function in Python's Pandas library is a fundamental tool for combining data from multiple DataFrames, much like SQL joins. This capability is crucial when you're dealing with relational data, enabling you to bring together information from different sources efficiently and intuitively.
In this article, you will learn how to use the merge()
function to join two DataFrames. Discover various join types, such as inner, outer, left, and right joins, and learn how to specify which columns to join on. This knowledge will help you handle data merging tasks with confidence and precision.
Create two DataFrames with common and unique columns.
Merge them using the merge()
function by specifying the type of join.
import pandas as pd
df1 = pd.DataFrame({
'key': ['A', 'B', 'C', 'D'],
'value_df1': [1, 2, 3, 5]
})
df2 = pd.DataFrame({
'key': ['B', 'D', 'E', 'F'],
'value_df2': [2, 4, 6, 8]
})
merged_inner = pd.merge(df1, df2, on='key', how='inner')
print(merged_inner)
This code merges df1
and df2
on the common column 'key' with an inner join. The result is a new DataFrame that includes only rows that have matching values in the 'key' column from both DataFrames.
Apply different types of joins to understand their behavior.
Adjust the how
parameter in the merge()
function to change the join type.
merged_outer = pd.merge(df1, df2, on='key', how='outer')
merged_left = pd.merge(df1, df2, on='key', how='left')
merged_right = pd.merge(df1, df2, on='key', how='right')
print("Outer Join:\n", merged_outer)
print("Left Join:\n", merged_left)
print("Right Join:\n", merged_right)
df1
and the matched rows from df2
, filling in NaN in places of non-match.df2
and the matched rows from df1
, filling in NaN in places of non-match.Create DataFrames with multiple common columns.
Use merge()
by specifying a list of columns.
df3 = pd.DataFrame({
'key1': ['A', 'B', 'C'],
'key2': ['X', 'Y', 'Z'],
'value_df3': [1, 2, 3]
})
df4 = pd.DataFrame({
'key1': ['B', 'C', 'D'],
'key2': ['Y', 'Z', 'X'],
'value_df4': [4, 5, 6]
})
merged_multiple_keys = pd.merge(df3, df4, on=['key1', 'key2'], how='inner')
print(merged_multiple_keys)
Here, df3
and df4
are merged on two columns, key1
and key2
. The inner
join type results in a DataFrame that includes only the rows with matching values in both specified columns from both DataFrames.
left_on
and right_on
for Mismatched Column NamesHandle cases where the join columns have different names in the DataFrames.
Specify left_on
and right_on
in the merge()
function.
df5 = pd.DataFrame({
'lkey': ['A', 'B', 'C'],
'value': [1, 2, 3]
})
df6 = pd.DataFrame({
'rkey': ['A', 'B', 'D'],
'value': [4, 5, 6]
})
merged_different_keys = pd.merge(df5, df6, left_on='lkey', right_on='rkey', how='inner')
print(merged_different_keys)
This example demonstrates merging df5
and df6
where the keys have different names in each DataFrame, utilizing left_on
and right_on
to specify the corresponding columns.
The merge()
function in Python’s Pandas library is an essential tool for combining DataFrames in various ways, mimicking the behavior of SQL joins. Using this function, you can effectively manage and analyze relational data by performing inner, outer, left, and right joins, as well as handling more complex scenarios involving multiple keys or mismatched column names. By mastering these techniques, you elevate your data manipulation skills, making data analysis tasks more streamlined and insightful.