Python Pandas DataFrame sample() - Random Row or Column Selection

Updated on December 24, 2024
sample() header image

Introduction

When working with data in Python, the Pandas library is a vital tool for data manipulation and analysis. One of the many useful functions it offers is sample(), which allows you to randomly select rows or columns from a DataFrame. This function is particularly handy for tasks such as creating subsets of data for model training/testing, or simply for performing random checks on data samples.

In this article, you will learn how to effectively utilize the sample() function from the Pandas library to select random rows or columns within a DataFrame. Explore the various parameters that can customize how the sampling is performed, enabling specific data set scenarios to be efficiently addressed.

Understanding the sample() Function

Basic Usage for Row Sampling

  1. Import the Pandas library and create a DataFrame.

  2. Use the sample() function to randomly select a specific number of rows.

    python
    import pandas as pd
    
    # Creating a DataFrame
    data = {'Name': ['John', 'Anna', 'James', 'Linda'],
            'Age': [28, 22, 35, 32]}
    df = pd.DataFrame(data)
    
    # Sampling rows
    sample_rows = df.sample(n=2)
    print(sample_rows)
    

    This code snippet creates a DataFrame with names and ages, and sample(n=2) randomly picks 2 rows from this DataFrame. The output varies each time you run it due to the randomness of the sampling process.

Using the frac Parameter for Proportional Sampling

  1. Understand that the function can also accept a fraction instead of a fixed number.

  2. Use the frac parameter to select a percentage of the total DataFrame rows.

    python
    # Sampling a fraction of the DataFrame
    sample_frac = df.sample(frac=0.5)
    print(sample_frac)
    

    Here, frac=0.5 instructs Pandas to return 50% of the rows. Since the original DataFrame has 4 rows, it returns 2 rows randomly.

Setting a Seed for Reproducibility

  1. For reproducible results, use the random_state parameter.

  2. Pass an integer to random_state which acts as the seed for the random number generator.

    python
    # Sampling with a seed for reproducibility
    sample_seed = df.sample(n=2, random_state=1)
    print(sample_seed)
    

    Setting random_state=1 ensures that the same rows are returned every time you run the code, which is crucial for reproducible research or consistent test setups.

Sampling by Columns

Basic Column Sampling

  1. To sample columns instead of rows, use the axis parameter.

  2. Set axis=1 to switch the function's focus from rows to columns.

    python
    # Sampling two columns
    sample_columns = df.sample(n=2, axis=1)
    print(sample_columns)
    

    This command randomly picks 2 columns from the DataFrame. The columns included in the output can change each time you run the code, depending on the randomness unless a seed is set.

Combined Approach: Rows and Columns

  1. Combine both row and column sampling in a single operation.

  2. Adjust the parameters to target a specific number of rows and columns.

    python
    # Sampling rows and columns simultaneously
    sample_both = df.sample(n=2).sample(n=1, axis=1)
    print(sample_both)
    

    The first sample(n=2) selects two rows randomly, and the following .sample(n=1, axis=1) picks one column out of the resulting two-row DataFrame. Each execution can result in different subsets due to the random selection process.

Conclusion

The sample() function in the Pandas library offers a powerful way to randomly select rows or columns from a DataFrame, which can be crucial for creating training and testing datasets, or for performing data integrity checks. By mastering the use of this function, including its parameters such as n, frac, random_state, and axis, you enhance the flexibility and effectiveness of your data manipulation tasks. Implement these techniques to ensure your data processing workflows are both robust and versatile.