
Introduction
When working with data in Python, the Pandas library is a vital tool for data manipulation and analysis. One of the many useful functions it offers is sample()
, which allows you to randomly select rows or columns from a DataFrame. This function is particularly handy for tasks such as creating subsets of data for model training/testing, or simply for performing random checks on data samples.
In this article, you will learn how to effectively utilize the sample()
function from the Pandas library to select random rows or columns within a DataFrame. Explore the various parameters that can customize how the sampling is performed, enabling specific data set scenarios to be efficiently addressed.
Understanding the sample() Function
Basic Usage for Row Sampling
Import the Pandas library and create a DataFrame.
Use the
sample()
function to randomly select a specific number of rows.pythonimport pandas as pd # Creating a DataFrame data = {'Name': ['John', 'Anna', 'James', 'Linda'], 'Age': [28, 22, 35, 32]} df = pd.DataFrame(data) # Sampling rows sample_rows = df.sample(n=2) print(sample_rows)
This code snippet creates a DataFrame with names and ages, and
sample(n=2)
randomly picks 2 rows from this DataFrame. The output varies each time you run it due to the randomness of the sampling process.
Using the frac Parameter for Proportional Sampling
Understand that the function can also accept a fraction instead of a fixed number.
Use the
frac
parameter to select a percentage of the total DataFrame rows.python# Sampling a fraction of the DataFrame sample_frac = df.sample(frac=0.5) print(sample_frac)
Here,
frac=0.5
instructs Pandas to return 50% of the rows. Since the original DataFrame has 4 rows, it returns 2 rows randomly.
Setting a Seed for Reproducibility
For reproducible results, use the
random_state
parameter.Pass an integer to
random_state
which acts as the seed for the random number generator.python# Sampling with a seed for reproducibility sample_seed = df.sample(n=2, random_state=1) print(sample_seed)
Setting
random_state=1
ensures that the same rows are returned every time you run the code, which is crucial for reproducible research or consistent test setups.
Sampling by Columns
Basic Column Sampling
To sample columns instead of rows, use the
axis
parameter.Set
axis=1
to switch the function's focus from rows to columns.python# Sampling two columns sample_columns = df.sample(n=2, axis=1) print(sample_columns)
This command randomly picks 2 columns from the DataFrame. The columns included in the output can change each time you run the code, depending on the randomness unless a seed is set.
Combined Approach: Rows and Columns
Combine both row and column sampling in a single operation.
Adjust the parameters to target a specific number of rows and columns.
python# Sampling rows and columns simultaneously sample_both = df.sample(n=2).sample(n=1, axis=1) print(sample_both)
The first
sample(n=2)
selects two rows randomly, and the following.sample(n=1, axis=1)
picks one column out of the resulting two-row DataFrame. Each execution can result in different subsets due to the random selection process.
Conclusion
The sample()
function in the Pandas library offers a powerful way to randomly select rows or columns from a DataFrame, which can be crucial for creating training and testing datasets, or for performing data integrity checks. By mastering the use of this function, including its parameters such as n
, frac
, random_state
, and axis
, you enhance the flexibility and effectiveness of your data manipulation tasks. Implement these techniques to ensure your data processing workflows are both robust and versatile.
No comments yet.