When working with data in Python, the Pandas library is a vital tool for data manipulation and analysis. One of the many useful functions it offers is sample()
, which allows you to randomly select rows or columns from a DataFrame. This function is particularly handy for tasks such as creating subsets of data for model training/testing, or simply for performing random checks on data samples.
In this article, you will learn how to effectively utilize the sample()
function from the Pandas library to select random rows or columns within a DataFrame. Explore the various parameters that can customize how the sampling is performed, enabling specific data set scenarios to be efficiently addressed.
Import the Pandas library and create a DataFrame.
Use the sample()
function to randomly select a specific number of rows.
import pandas as pd
# Creating a DataFrame
data = {'Name': ['John', 'Anna', 'James', 'Linda'],
'Age': [28, 22, 35, 32]}
df = pd.DataFrame(data)
# Sampling rows
sample_rows = df.sample(n=2)
print(sample_rows)
This code snippet creates a DataFrame with names and ages, and sample(n=2)
randomly picks 2 rows from this DataFrame. The output varies each time you run it due to the randomness of the sampling process.
Understand that the function can also accept a fraction instead of a fixed number.
Use the frac
parameter to select a percentage of the total DataFrame rows.
# Sampling a fraction of the DataFrame
sample_frac = df.sample(frac=0.5)
print(sample_frac)
Here, frac=0.5
instructs Pandas to return 50% of the rows. Since the original DataFrame has 4 rows, it returns 2 rows randomly.
For reproducible results, use the random_state
parameter.
Pass an integer to random_state
which acts as the seed for the random number generator.
# Sampling with a seed for reproducibility
sample_seed = df.sample(n=2, random_state=1)
print(sample_seed)
Setting random_state=1
ensures that the same rows are returned every time you run the code, which is crucial for reproducible research or consistent test setups.
To sample columns instead of rows, use the axis
parameter.
Set axis=1
to switch the function's focus from rows to columns.
# Sampling two columns
sample_columns = df.sample(n=2, axis=1)
print(sample_columns)
This command randomly picks 2 columns from the DataFrame. The columns included in the output can change each time you run the code, depending on the randomness unless a seed is set.
Combine both row and column sampling in a single operation.
Adjust the parameters to target a specific number of rows and columns.
# Sampling rows and columns simultaneously
sample_both = df.sample(n=2).sample(n=1, axis=1)
print(sample_both)
The first sample(n=2)
selects two rows randomly, and the following .sample(n=1, axis=1)
picks one column out of the resulting two-row DataFrame. Each execution can result in different subsets due to the random selection process.
The sample()
function in the Pandas library offers a powerful way to randomly select rows or columns from a DataFrame, which can be crucial for creating training and testing datasets, or for performing data integrity checks. By mastering the use of this function, including its parameters such as n
, frac
, random_state
, and axis
, you enhance the flexibility and effectiveness of your data manipulation tasks. Implement these techniques to ensure your data processing workflows are both robust and versatile.