How to Use WebDataset and PyTorch with Vultr Object Storage
Introduction
WebDataset is a PyTorch Dataset implementation to work with large-scale datasets efficiently. WebDataset provides sequential/streaming access directly to datasets stored in tar archives in a local disk or a cloud storage object for training without unpacking and can stream data with no local storage. With WebDataset, you can scale up the same code from running local experiments to using hundreds of GPUs. This article explains how to use WebDataset and PyTorch with tar archives stored on Vultr Object Storage.
At the end of this article, you know:
- How to set up s3cmd to connect with a Vultr Object Storage.
- How to create an image classification dataset (CIFAR10), upload it to Object Storage, and load the data for training.
- How to create a speech recognition dataset (LibriSpeech) with different encoding approaches.
- How to split the dataset into multiple shards to achieve parallel loading and shuffle by shards.
Prerequisites
- Create a Vultr Object Storage bucket.
- Prepare the access key and secret key for your Object Storage
- An Ubuntu Workstation with PyTorch installed. See our selection of GPU-enabled images in Vultr Marketplace.
Install WebDataset and S3cmd
Install webdataset
and s3cmd
using Python package manager.
$ pip install webdataset s3cmd
Configure s3cmd with Vultr Object Storage
Run the configuration and enter the information of your Vultr Object Storage
$ s3cmd --configure
Here is an example setup
Enter new values or accept defaults in brackets with Enter. Refer to user manual for detailed description of all options. Access key and Secret key are your identifiers for Amazon S3. Leave them empty for using the env variables. Access Key: WSZ3GHRPA189CVSGRKU6 Secret Key: uKCwHbdmUEqTbjVno3k373rfORzSam0lWAXbaAm6 Default Region [US]: Use "s3.amazonaws.com" for S3 Endpoint and not modify it to the target Amazon S3. S3 Endpoint [s3.amazonaws.com]: sgp1.vultrobjects.com Use "%(bucket)s.s3.amazonaws.com" to the target Amazon S3. "%(bucket)s" and "%(location)s" vars can be used if the target S3 system supports dns based buckets. DNS-style bucket+hostname:port template for accessing a bucket [%(bucket)s.s3.amazonaws.com]: sgp1.vultrobjects.com Encryption password is used to protect your files from reading by unauthorized persons while in transfer to S3 Encryption password: Path to GPG program [/usr/bin/gpg]: When using secure HTTPS protocol all communication with Amazon S3 servers is protected from 3rd party eavesdropping. This method is slower than plain HTTP, and can only be proxied with Python 2.7 or newer Use HTTPS protocol [Yes]: On some networks all internet access must go through a HTTP proxy. Try setting it here if you can't connect to S3 directly HTTP Proxy server name: New settings: Access Key: WSZ3GHRPA189CVSGRKU6 Secret Key: uKCwHbdmUEqTbjVno3k373rfORzSam0lWAXbaAm6 Default Region: US S3 Endpoint: sgp1.vultrobjects.com DNS-style bucket+hostname:port template for accessing a bucket: sgp1.vultrobjects.com Encryption password: Path to GPG program: /usr/bin/gpg Use HTTPS protocol: True HTTP Proxy server name: HTTP Proxy server port: 0 Test access with supplied credentials? [Y/n] Please wait, attempting to list all buckets... Success. Your access key and secret key worked fine :-) Now verifying that encryption works... Not configured. Never mind. Save settings? [y/N] y Configuration saved to '/home/ubuntu/.s3cfg'
Make a bucket to store the dataset. Replace
demo-bucket
with your bucket name.$ s3cmd mb s3://demo-bucket
Create a WebDataset for Image Recognition
This section shows how to create a tar archive for the CIFAR10 dataset, upload it to a Vultr Object Storage, and load the data for training.
This section uses torchvision
library to simplify the explanation of processing the dataset. However, you can use any method to load each sample into separate variables.
Create CIFAR10 WebDataset
Create a file named
create_cifar10.py
as follows:import torchvision import webdataset as wds import sys dataset = torchvision.datasets.CIFAR10(root="./temp", download=True) for index, (input, output) in enumerate(dataset): if index < 3: print("input", type(input), input) print("output", type(output), output) print("") else: break
Run the script
create_cifar10.py
$ python create_cifar10.py
Here is an example result. Each sample is a tuple of a Python Imaging Library (PIL) Image and an integer number as the category label.
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./temp/cifar-10-python.tar.gz 100.0% Extracting ./temp/cifar-10-python.tar.gz to ./temp input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7F373B6CA6B0> output <class 'int'> 6 input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7F373B6CA680> output <class 'int'> 9 input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7F373B6CA6E0> output <class 'int'> 9
Change the file
create_cifar10.py
as follows. For each sample, an instance ofwds.TarWriter
saves a dictionary ofinput
with keyppm
andoutput
with keycls
. The keyppm
makes the writer save the image with an image encoder. The keycls
makes the writer save the label as an integer.import torchvision import webdataset as wds import sys dataset = torchvision.datasets.CIFAR10(root="./temp", download=True) filename = "cifar10.tar" sink = wds.TarWriter(filename) for index, (input, output) in enumerate(dataset): if index == 0: print("input", type(input), input) print("output", type(output), output) sink.write({ "__key__": "sample%06d" % index, "ppm": input, "cls": output, }) sink.close()
Run the script
create_cifar10.py
to create a tar archive namedcifar10.tar
.$ python create_cifar10.py
Upload CIFAR10 to the Vultr Object Storage
Run the following command to upload the cifar10.tar
to the Vultr Object Storage
$ s3cmd put cifar10.tar s3://demo-bucket
Load CIFAR10 with WebDataset
Create a file named
load_cifar10.py
as follows. The codedecode("pil")
makes the WebDataset decode the input image into a PIL image instead of raw bytes. The codeto_tuple("ppm", "cls")
makes the dataset return a tuple from key"ppm"
and"cls"
.import torch import webdataset as wds from itertools import islice url = "pipe:s3cmd -q get s3://demo-bucket/cifar10.tar -" dataset = wds.WebDataset(url).decode("pil").to_tuple("ppm", "cls") for sample in islice(dataset, 0, 3): input, output = sample print("input", type(input), input) print("output", type(output), output) print()
Run the script
load_cifar10.py
$ python load_cifar10.py
Here is an example result. The input and output match the previous step.
input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7FB50AE18E20> output <class 'int'> 6 input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7FB50ACDE230> output <class 'int'> 9 input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7FB50AE19AB0> output <class 'int'> 9
(Optional) Change the
load_cifar10.py
to perform data augmentation and normalization.import torch import webdataset as wds from itertools import islice from torchvision import transforms def identity(x): return x normalize = transforms.Normalize( mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) preprocess = transforms.Compose([ transforms.RandomResizedCrop(32), transforms.RandomHorizontalFlip(), transforms.ToTensor(), normalize, ]) url = "pipe:s3cmd -q get s3://demo-bucket/cifar10.tar -" dataset = ( wds.WebDataset(url) .shuffle(64) .decode("pil") .to_tuple("ppm", "cls") .map_tuple(preprocess, identity) ) for sample in islice(dataset, 0, 3): input, output = sample print("input", type(input), input) print("output", type(output), output) print()
Data Loading and Preprocessing with DataLoader
The dataset from WebDataset is a standard PyTorch IterableDataset
instance. WebDataset is fully compatible with the standard PyTorch DataLoader, replicating the dataset instance across multiple threads and performing parallel data loading and preprocessing.
Here is an example of using the standard PyTorch DataLoader
loader = torch.utils.data.DataLoader(dataset, num_workers=4, batch_size=8)
batch = next(iter(loader))
batch[0].shape, batch[1].shape
The authors of WebDataset recommend explicitly batching in the dataset instance as follows:
batch_size = 20
dataloader = torch.utils.data.DataLoader(dataset.batched(batch_size), num_workers=4, batch_size=None)
images, targets = next(iter(dataloader))
images.shape
If you want to change the batch size dynamically, WebDataset provides a wrapper that adds a fluid interface to the standard PyTorch DataLoader. Here is an example of using the WebLoader from WebDataset
dataset = dataset.batched(16)
loader = wds.WebLoader(dataset, num_workers=4, batch_size=None)
loader = loader.unbatched().shuffle(1000).batched(12)
batch = next(iter(loader))
batch[0].shape, batch[1].shape
Create a WebDataset for Speech Recognition
This section shows how to create a tar archive for LibriSpeech dataset. LibriSpeech is a corpus of about 1000 hours of 16kHz read English speech. Each sample contains audio in the Free Lossless Audio Codec (FLAC) format, an English text, and some integer numbers.
This section uses torchaudio
library to simplify the explanation of processing the dataset. However, you can use any method to load each sample into separate variables.
Create LibriSpeech WebDataset
Create a file named
create_librispeech.py
as follows:import torchaudio import webdataset as wds import sys import os dataset = torchaudio.datasets.LIBRISPEECH( root="./temp", folder_in_archive="LibriSpeech", url="train-clean-100", download=True) for index, sample in enumerate(dataset): waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample if index < 3: for i, item in enumerate(sample): print(type(item), item) print() else: break
Run the script
create_librispeech.py
$ python create_librispeech.py
Here is an example result. Each sample is a tuple of a PyTorch Tensor, a string for English text and some integer numbers.
<class 'torch.Tensor'> tensor([[-0.0065, -0.0055, -0.0062, ..., 0.0033, 0.0005, -0.0095]]) <class 'int'> 16000 <class 'str'> CHAPTER ONE MISSUS RACHEL LYNDE IS SURPRISED MISSUS RACHEL LYNDE LIVED JUST WHERE THE AVONLEA MAIN ROAD DIPPED DOWN INTO A LITTLE HOLLOW FRINGED WITH ALDERS AND LADIES EARDROPS AND TRAVERSED BY A BROOK <class 'int'> 103 <class 'int'> 1240 <class 'int'> 0 <class 'torch.Tensor'> tensor([[-0.0059, -0.0045, -0.0067, ..., 0.0007, 0.0034, 0.0047]]) <class 'int'> 16000 <class 'str'> THAT HAD ITS SOURCE AWAY BACK IN THE WOODS OF THE OLD CUTHBERT PLACE IT WAS REPUTED TO BE AN INTRICATE HEADLONG BROOK IN ITS EARLIER COURSE THROUGH THOSE WOODS WITH DARK SECRETS OF POOL AND CASCADE BUT BY THE TIME IT REACHED LYNDE'S HOLLOW IT WAS A QUIET WELL CONDUCTED LITTLE STREAM <class 'int'> 103 <class 'int'> 1240 <class 'int'> 1 <class 'torch.Tensor'> tensor([[ 0.0052, 0.0074, 0.0113, ..., -0.0007, -0.0039, -0.0058]]) <class 'int'> 16000 <class 'str'> FOR NOT EVEN A BROOK COULD RUN PAST MISSUS RACHEL LYNDE'S DOOR WITHOUT DUE REGARD FOR DECENCY AND DECORUM IT PROBABLY WAS CONSCIOUS THAT MISSUS RACHEL WAS SITTING AT HER WINDOW KEEPING A SHARP EYE ON EVERYTHING THAT PASSED FROM BROOKS AND CHILDREN UP <class 'int'> 103 <class 'int'> 1240 <class 'int'> 2
Change the file
create_librispeech.py
as follows. For each sample, thewaveform.pth
makes the writer save the audio wave form in the PyTorch Tensor format. Thetranscript.text
makes the writer save the English transcript as a text. Other keys end with.id
to save as integer number.import torchaudio import webdataset as wds import sys import os dataset = torchaudio.datasets.LIBRISPEECH( root="./temp", folder_in_archive="LibriSpeech", url="train-clean-100", download=True) filename = "LibriSpeech.tar" sink = wds.TarWriter(filename) for index, sample in enumerate(dataset): waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample if index < 3: for i, item in enumerate(sample): print(type(item), item) print() sink.write({ "__key__": "sample%06d" % index, "waveform.pth": waveform, "sample_rate.id": sample_rate, "transcript.text": transcript, "speaker.id": speaker_id, "chapter.id": chapter_id, "utterance.id": utterance_id }) sink.close()
(Optional) Run the script
create_librispeech.py
to create a tar archive namedLibriSpeech.tar
. The result is a large tar archive with a size of 22GB. This approach creates a bigger size than the original dataset (6GB) as the result of decoding all the audio files into a Tensor and save into the tar archive file.$ python create_librispeech.py
Change the
create_librispeech.py
as follows to save the audio in FLAC format. This script reads the raw bytes from the.flac
files and saves withwaveform.flac
import torchaudio import webdataset as wds import sys import os dataset = torchaudio.datasets.LIBRISPEECH( root="./temp", folder_in_archive="LibriSpeech", url="train-clean-100", download=True) filename = "LibriSpeech.tar" root_dir = os.path.join("./temp", "LibriSpeech", "train-clean-100") sink = wds.TarWriter(filename) for index, sample in enumerate(dataset): waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample if index < 3: for i, item in enumerate(sample): print(type(item), item) print() # Load audio fname = f"{speaker_id}-{chapter_id}-{utterance_id:04d}.flac" fpath = os.path.join(root_dir, str(speaker_id), str(chapter_id), fname) with open(fpath, "rb") as file: sink.write({ "__key__": "sample%06d" % index, "waveform.flac": file.read(), "sample_rate.id": sample_rate, "transcript.text": transcript, "speaker.id": speaker_id, "chapter.id": chapter_id, "utterance.id": utterance_id }) sink.close()
Run the script
create_librispeech.py
to create a tar archive namedLibriSpeech.tar
. The result is a tar archive with a size of 6GB.$ python create_librispeech.py
Upload LibriSpeech to the Vultr Object Storage
Run the following command to upload the LibriSpeech.tar
to the Vultr Object Storage
$ s3cmd put LibriSpeech.tar s3://demo-bucket
Load LibriSpeech with WebDataset
Create a file named
load_librispeech.py
as follows. The codedecode(wds.torch_audio)
makes the WebDataset use thetorchaudio
to decode the audio.import torch from torch.utils.data import IterableDataset from torchvision import transforms import webdataset as wds from itertools import islice url = "pipe:s3cmd -q get s3://demo-bucket/LibriSpeech.tar -" dataset = wds.WebDataset(url).decode(wds.torch_audio).to_tuple("waveform.flac", "sample_rate.id", "transcript.text", "speaker.id", "chapter.id", "utterance.id") for sample in islice(dataset, 0, 3): waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample for i, item in enumerate(sample): print(type(item), item, ) print()
Run the script
load_librispeech.py
$ python load_librispeech.py
Here is an example result. The input and output match the previous step.
<class 'tuple'> (tensor([[-0.0065, -0.0055, -0.0062, ..., 0.0033, 0.0005, -0.0095]]), 16000) <class 'int'> 16000 <class 'str'> CHAPTER ONE MISSUS RACHEL LYNDE IS SURPRISED MISSUS RACHEL LYNDE LIVED JUST WHERE THE AVONLEA MAIN ROAD DIPPED DOWN INTO A LITTLE HOLLOW FRINGED WITH ALDERS AND LADIES EARDROPS AND TRAVERSED BY A BROOK <class 'int'> 103 <class 'int'> 1240 <class 'int'> 0 <class 'tuple'> (tensor([[-0.0059, -0.0045, -0.0067, ..., 0.0007, 0.0034, 0.0047]]), 16000) <class 'int'> 16000 <class 'str'> THAT HAD ITS SOURCE AWAY BACK IN THE WOODS OF THE OLD CUTHBERT PLACE IT WAS REPUTED TO BE AN INTRICATE HEADLONG BROOK IN ITS EARLIER COURSE THROUGH THOSE WOODS WITH DARK SECRETS OF POOL AND CASCADE BUT BY THE TIME IT REACHED LYNDE'S HOLLOW IT WAS A QUIET WELL CONDUCTED LITTLE STREAM <class 'int'> 103 <class 'int'> 1240 <class 'int'> 1 <class 'tuple'> (tensor([[ 0.0052, 0.0074, 0.0113, ..., -0.0007, -0.0039, -0.0058]]), 16000) <class 'int'> 16000 <class 'str'> FOR NOT EVEN A BROOK COULD RUN PAST MISSUS RACHEL LYNDE'S DOOR WITHOUT DUE REGARD FOR DECENCY AND DECORUM IT PROBABLY WAS CONSCIOUS THAT MISSUS RACHEL WAS SITTING AT HER WINDOW KEEPING A SHARP EYE ON EVERYTHING THAT PASSED FROM BROOKS AND CHILDREN UP <class 'int'> 103 <class 'int'> 1240 <class 'int'> 2
Run the script
load_cifar10.py
$ python load_cifar10.py
Use Sharding with WebDataset
WebDataset supports sharding to split the dataset into many shards to achieve parallel I/O and shuffle data by shards.
Here is an example of creating and loading the LibriSpeech dataset with Sharding.
Create a file named
create_librispeech_sharding.py
as follows:import torchaudio import webdataset as wds import sys import os dataset = torchaudio.datasets.LIBRISPEECH( root="./temp", folder_in_archive="LibriSpeech", url="train-clean-100", download=True) root_dir = os.path.join("./temp", "LibriSpeech", "train-clean-100") filename = "LibriSpeech-%04d.tar" sink = wds.ShardWriter(filename, maxsize=1e9) for index, sample in enumerate(dataset): waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample if index < 3: for i, item in enumerate(sample): print(type(item), item) print() # Load audio fname = f"{speaker_id}-{chapter_id}-{utterance_id:04d}.flac" fpath = os.path.join(root_dir, str(speaker_id), str(chapter_id), fname) with open(fpath, "rb") as file: sink.write({ "__key__": "sample%06d" % index, "waveform.flac": file.read(), "sample_rate.id": sample_rate, "transcript.text": transcript, "speaker.id": speaker_id, "chapter.id": chapter_id, "utterance.id": utterance_id }) sink.close()
Run the script
create_librispeech_sharding.py
to create multiple shards with 1GB for each shard.$ python create_librispeech_sharding.py
Upload all the shards into the Vultr Object Storage
$ s3cmd put LibriSpeech-* s3://demo-bucket
Create a file named
load_librispeech_sharding.py
as follows. The codeshardshuffle=True
makes the WebDataset shuffle the dataset based on the shards and shuffle the samples inline with theshuffle
method.import torch from torch.utils.data import IterableDataset from torchvision import transforms import webdataset as wds from itertools import islice url = "pipe:s3cmd -q get s3://demo-bucket/LibriSpeech-{0000..0006}.tar -" dataset = wds.WebDataset(url, shardshuffle=True).shuffle(100).decode(wds.torch_audio).to_tuple("waveform.flac", "sample_rate.id", "transcript.text", "speaker.id", "chapter.id", "utterance.id") for sample in islice(dataset, 0, 3): waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample for i, item in enumerate(sample): print(type(item), item, ) print()
Run the script
load_librispeech_sharding.py
$ python load_librispeech_sharding.py