Python Pandas DataFrame fillna() - Fill Missing Values

Updated on January 2, 2025
fillna() header image

Introduction

When working with data in Python, it's common to encounter missing values. These gaps can disrupt data analysis processes and lead to incorrect results if not handled properly. Pandas, a powerful data manipulation library in Python, offers several methods for managing missing values, one of which is the fillna() method. This method allows you to fill in these missing values with a specified value or method.

In this article, you will learn how to effectively use the fillna() method to manage missing values in your data sets. Discover the various strategies for filling missing values, including filling with a constant value, using a computed value like the mean or median, and forward or backward filling.

Understanding fillna() Function

The fillna() method in Pandas is versatile, allowing for various approaches to handle missing data, from simple fill-ins to more complex interpolations based on other values in your data.

Filling with a Constant Value

  1. Import the pandas library and create a DataFrame with missing values.

  2. Use the fillna() with a constant value to fill missing entries.

    python
    import pandas as pd
    df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [None, 2, 3, 4]})
    df_filled = df.fillna(0)
    print(df_filled)
    

    In this example, all None values in DataFrame df are replaced with 0. This method is straightforward but might not be suitable for all datasets as it could skew the data analysis.

Filling with a Statistical Value (Mean, Median)

  1. Calculate the mean or median of the columns.

  2. Use fillna() to fill the missing values with the calculated mean or median.

    python
    mean_value = df.mean()
    df_filled_mean = df.fillna(mean_value)
    print(df_filled_mean)
    

    This code calculates the mean of each column and then uses these values to replace the missing entries. This method preserves the central tendency of the data, making it a potent option for statistical analysis.

Forward and Backward Filling

  1. Use fillna() with the method argument set to 'ffill' or 'bfill' to fill missing values by propagating the next or previous value respectively.

  2. Apply the method to the DataFrame.

    python
    df_filled_forward = df.fillna(method='ffill')
    df_filled_backward = df.fillna(method='bfill')
    print("Forward Fill:\n", df_filled_forward)
    print("Backward Fill:\n", df_filled_backward)
    

    Forward filling (ffill) propagates the last known non-null value forward until another non-null value is encountered. Backward filling (bfill) does the opposite: it fills values backwards. These strategies are useful in time series data where the assumption that the value does not change suddenly is reasonable.

Advanced Usage of fillna()

Using Dictionaries for Column-specific Fill Values

  1. Prepare a dictionary specifying fill values for each column.

  2. Pass the dictionary to fillna() to apply different fill values for each column.

    python
    fill_values = {'A': 0, 'B': 1}
    df_filled_custom = df.fillna(fill_values)
    print(df_filled_custom)
    

    Here, column 'A' is filled with 0 and column 'B' with 1. This method offers fine control over how each column's missing values are filled.

Using a Function to Determine Fill Value

  1. Define a function that computes the fill value based on the data.

  2. Use the function with apply() along with fillna() to fill the DataFrame effectively.

    python
    def compute_fill_value(series):
        return series.median()
    df_filled_func = df.apply(lambda x: x.fillna(compute_fill_value(x)))
    print(df_filled_func)
    

    This approach is particularly useful when the operation to compute the fill value is complex or when it depends on multiple columns within the DataFrame.

Conclusion

The fillna() function in Pandas is a comprehensive tool for dealing with missing data in Python. Whether filling missing values with constant numbers, statistical values, using dictionary mapping, or even custom functions, fillna() supports a wide range of operations. Implementing the discussed methods ensures that the analysis remains robust and the datasets are well-prepared, preventing biases introduced by missing data. Harness these techniques to maintain the integrity and accuracy of your data analysis.