Python Pandas DataFrame describe() - Generate Descriptive Statistics

Updated on December 24, 2024
describe() header image

Introduction

When analyzing data, especially large datasets, gaining a quick summary of the data's statistical measurements is crucial. The describe() method in the Python Pandas library serves this exact purpose, providing an essential exploratory tool that generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution. It acts as a first step in the roadmap of data exploration and understanding.

In this article, you will learn how to leverage the describe() method to obtain descriptive statistics of datasets predominantly containing numerical data, while also exploring how this method applies to categorical data. Navigate through various examples that showcase how to tailor this method to meet specific analysis needs effectively.

Understanding the describe() Method

Basics of describe()

  1. Import the Pandas library and create a sample DataFrame.

  2. Apply the describe() method to get an overview of statistical summaries.

    python
    import pandas as pd
    
    data = {
        'Age': [25, 22, 23, 25, 24],
        'Income': [50000, 54000, 50000, 48500, 60000]
    }
    df = pd.DataFrame(data)
    description = df.describe()
    print(description)
    

    This code snippet generates a DataFrame from a dictionary of age and income, and then applies describe() to output statistics like count, mean, std (standard deviation), min, 25% (first quartile), 50% (median), 75% (third quartile), and max.

Insights from Output

  1. Interpret key stats from the describe() output:

    • count: Shows the number of entries.
    • mean: Provides the average value.
    • std: Indicates the standard deviation.
    • min: The minimum value.
    • 25%: First quartile.
    • 50%: Median (second quartile).
    • 75%: Third quartile.
    • max: The maximum value.

    These metrics help identify central values, spread, and range, crucial for initial data assessments.

Advanced Usage of describe()

Customizing Descriptive Statistics

  1. Customize the statistics you want to see using the percentiles parameter.

  2. Execute the method with specified percentiles.

    python
    custom_description = df.describe(percentiles=[0.05, 0.25, 0.75, 0.95])
    print(custom_description)
    

    By setting custom percentiles, modify the output to include 5th and 95th percentiles, giving insights into the distribution’s tails.

Handling Non-Numeric Data

  1. Understand the behavior of describe() with non-numeric (categorical or object) data.

  2. Use the include parameter to force the method to consider categorical data.

    python
    df['Gender'] = ['Female', 'Male', 'Female', 'Female', 'Male']
    description_all = df.describe(include='all')
    print(description_all)
    

    Including categorical data results in statistics such as count, unique, top (mode), and freq (frequency of mode), broadening the scope of analysis.

Descriptive Statistics for Time Series Data

  1. When analyzing time series, it’s important to handle date and time data appropriately.

  2. Assume a DataFrame has a time-related column, and apply describe().

    python
    df['Date'] = pd.date_range('20230101', periods=5)
    time_stats = df.describe(include=[np.number, 'datetime'])
    print(time_stats)
    

    Here, the method summarizes both numeric and datetime data, essential for time-based analysis.

Conclusion

The describe() method in Pandas powerfully encapsulates the statistical summary of DataFrame columns, providing a concrete foundation for any data analysis process. By understanding how to manipulate this function, especially with adjustments like custom percentiles and inclusive parameters for various data types, strengthen data exploration efforts efficiently. Apply these techniques to swiftly assess basic statistical insights, helping to drive more complex data analysis and decision-making processes.