How to Use WebDataset and PyTorch with Vultr Object Storage

Introduction

WebDataset is a PyTorch Dataset implementation to work with large-scale datasets efficiently. WebDataset provides sequential/streaming access directly to datasets stored in tar archives in a local disk or a cloud storage object for training without unpacking and can stream data with no local storage. With WebDataset, you can scale up the same code from running local experiments to using hundreds of GPUs. This article explains how to use WebDataset and PyTorch with tar archives stored on Vultr Object Storage.

At the end of this article, you know:

How to set up s3cmd to connect with a Vultr Object Storage.
How to create an image classification dataset (CIFAR10), upload it to Object Storage, and load the data for training.
How to create a speech recognition dataset (LibriSpeech) with different encoding approaches.
How to split the dataset into multiple shards to achieve parallel loading and shuffle by shards.

Prerequisites

Create a Vultr Object Storage bucket.
Prepare the access key and secret key for your Object Storage
An Ubuntu Workstation with PyTorch installed. See our selection of GPU-enabled images in Vultr Marketplace.

Install WebDataset and S3cmd

Install webdataset and s3cmd using Python package manager.

$ pip install webdataset s3cmd

Configure s3cmd with Vultr Object Storage

Run the configuration and enter the information of your Vultr Object Storage
```
 $ s3cmd --configure
```

Here is an example setup

 Enter new values or accept defaults in brackets with Enter.
 Refer to user manual for detailed description of all options.

 Access key and Secret key are your identifiers for Amazon S3. Leave them empty for using the env variables.
 Access Key: WSZ3GHRPA189CVSGRKU6
 Secret Key: uKCwHbdmUEqTbjVno3k373rfORzSam0lWAXbaAm6
 Default Region [US]:

 Use "s3.amazonaws.com" for S3 Endpoint and not modify it to the target Amazon S3.
 S3 Endpoint [s3.amazonaws.com]: sgp1.vultrobjects.com

 Use "%(bucket)s.s3.amazonaws.com" to the target Amazon S3. "%(bucket)s" and "%(location)s" vars can be used
 if the target S3 system supports dns based buckets.
 DNS-style bucket+hostname:port template for accessing a bucket [%(bucket)s.s3.amazonaws.com]: sgp1.vultrobjects.com

 Encryption password is used to protect your files from reading
 by unauthorized persons while in transfer to S3
 Encryption password:
 Path to GPG program [/usr/bin/gpg]:

 When using secure HTTPS protocol all communication with Amazon S3
 servers is protected from 3rd party eavesdropping. This method is
 slower than plain HTTP, and can only be proxied with Python 2.7 or newer
 Use HTTPS protocol [Yes]:

 On some networks all internet access must go through a HTTP proxy.
 Try setting it here if you can't connect to S3 directly
 HTTP Proxy server name:

 New settings:
   Access Key: WSZ3GHRPA189CVSGRKU6
   Secret Key: uKCwHbdmUEqTbjVno3k373rfORzSam0lWAXbaAm6
   Default Region: US
   S3 Endpoint: sgp1.vultrobjects.com
   DNS-style bucket+hostname:port template for accessing a bucket: sgp1.vultrobjects.com
   Encryption password:
   Path to GPG program: /usr/bin/gpg
   Use HTTPS protocol: True
   HTTP Proxy server name:
   HTTP Proxy server port: 0

 Test access with supplied credentials? [Y/n]
 Please wait, attempting to list all buckets...
 Success. Your access key and secret key worked fine :-)

 Now verifying that encryption works...
 Not configured. Never mind.

 Save settings? [y/N] y
 Configuration saved to '/home/ubuntu/.s3cfg'

Make a bucket to store the dataset. Replace demo-bucket with your bucket name.
```
 $ s3cmd mb s3://demo-bucket
```

Create a WebDataset for Image Recognition

This section shows how to create a tar archive for the CIFAR10 dataset, upload it to a Vultr Object Storage, and load the data for training.

This section uses torchvision library to simplify the explanation of processing the dataset. However, you can use any method to load each sample into separate variables.

Create CIFAR10 WebDataset

Create a file named create_cifar10.py as follows:

 import torchvision
 import webdataset as wds
 import sys

 dataset = torchvision.datasets.CIFAR10(root="./temp", download=True)

 for index, (input, output) in enumerate(dataset):
     if index < 3:
         print("input", type(input), input)
         print("output", type(output), output)
         print("")
     else:
         break

Run the script create_cifar10.py
```
 $ python create_cifar10.py
```

Here is an example result. Each sample is a tuple of a Python Imaging Library (PIL) Image and an integer number as the category label.

 Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./temp/cifar-10-python.tar.gz
 100.0%
 Extracting ./temp/cifar-10-python.tar.gz to ./temp
 input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7F373B6CA6B0>
 output <class 'int'> 6

 input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7F373B6CA680>
 output <class 'int'> 9

 input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7F373B6CA6E0>
 output <class 'int'> 9

Change the file create_cifar10.py as follows. For each sample, an instance of wds.TarWriter saves a dictionary of input with key ppm and output with key cls. The key ppm makes the writer save the image with an image encoder. The key cls makes the writer save the label as an integer.

 import torchvision
 import webdataset as wds
 import sys

 dataset = torchvision.datasets.CIFAR10(root="./temp", download=True)
 filename = "cifar10.tar"

 sink = wds.TarWriter(filename)
 for index, (input, output) in enumerate(dataset):
     if index == 0:
         print("input", type(input), input)
         print("output", type(output), output)
     sink.write({
         "__key__": "sample%06d" % index,
         "ppm": input,
         "cls": output,
     })
 sink.close()

Run the script create_cifar10.py to create a tar archive named cifar10.tar.
```
 $ python create_cifar10.py
```

Upload CIFAR10 to the Vultr Object Storage

Run the following command to upload the cifar10.tar to the Vultr Object Storage

$ s3cmd put cifar10.tar s3://demo-bucket

Load CIFAR10 with WebDataset

Create a file named load_cifar10.py as follows. The code decode("pil") makes the WebDataset decode the input image into a PIL image instead of raw bytes. The code to_tuple("ppm", "cls") makes the dataset return a tuple from key "ppm" and "cls".

 import torch
 import webdataset as wds
 from itertools import islice

 url = "pipe:s3cmd -q get s3://demo-bucket/cifar10.tar -"
 dataset = wds.WebDataset(url).decode("pil").to_tuple("ppm", "cls")

 for sample in islice(dataset, 0, 3):
     input, output = sample
     print("input", type(input), input)
     print("output", type(output), output)
     print()

Run the script load_cifar10.py
```
 $ python load_cifar10.py
```

Here is an example result. The input and output match the previous step.

 input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7FB50AE18E20>
 output <class 'int'> 6

 input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7FB50ACDE230>
 output <class 'int'> 9

 input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7FB50AE19AB0>
 output <class 'int'> 9

(Optional) Change the load_cifar10.py to perform data augmentation and normalization.

 import torch
 import webdataset as wds
 from itertools import islice
 from torchvision import transforms

 def identity(x):
     return x

 normalize = transforms.Normalize(
     mean=[0.485, 0.456, 0.406],
     std=[0.229, 0.224, 0.225])

 preprocess = transforms.Compose([
     transforms.RandomResizedCrop(32),
     transforms.RandomHorizontalFlip(),
     transforms.ToTensor(),
     normalize,
 ])

 url = "pipe:s3cmd -q get s3://demo-bucket/cifar10.tar -"
 dataset = (
     wds.WebDataset(url)
     .shuffle(64)
     .decode("pil")
     .to_tuple("ppm", "cls")
     .map_tuple(preprocess, identity)
 )
 for sample in islice(dataset, 0, 3):
     input, output = sample
     print("input", type(input), input)
     print("output", type(output), output)
     print()

Data Loading and Preprocessing with DataLoader

The dataset from WebDataset is a standard PyTorch IterableDataset instance. WebDataset is fully compatible with the standard PyTorch DataLoader, replicating the dataset instance across multiple threads and performing parallel data loading and preprocessing.

Here is an example of using the standard PyTorch DataLoader

loader = torch.utils.data.DataLoader(dataset, num_workers=4, batch_size=8)

batch = next(iter(loader))
batch[0].shape, batch[1].shape

The authors of WebDataset recommend explicitly batching in the dataset instance as follows:

batch_size = 20
dataloader = torch.utils.data.DataLoader(dataset.batched(batch_size), num_workers=4, batch_size=None)
images, targets = next(iter(dataloader))
images.shape

If you want to change the batch size dynamically, WebDataset provides a wrapper that adds a fluid interface to the standard PyTorch DataLoader. Here is an example of using the WebLoader from WebDataset

dataset = dataset.batched(16)

loader = wds.WebLoader(dataset, num_workers=4, batch_size=None)
loader = loader.unbatched().shuffle(1000).batched(12)

batch = next(iter(loader))
batch[0].shape, batch[1].shape

Create a WebDataset for Speech Recognition

This section shows how to create a tar archive for LibriSpeech dataset. LibriSpeech is a corpus of about 1000 hours of 16kHz read English speech. Each sample contains audio in the Free Lossless Audio Codec (FLAC) format, an English text, and some integer numbers.

This section uses torchaudio library to simplify the explanation of processing the dataset. However, you can use any method to load each sample into separate variables.

Create LibriSpeech WebDataset

Create a file named create_librispeech.py as follows:

 import torchaudio
 import webdataset as wds
 import sys
 import os

 dataset = torchaudio.datasets.LIBRISPEECH(
     root="./temp", folder_in_archive="LibriSpeech", url="train-clean-100",
     download=True)

 for index, sample in enumerate(dataset):
     waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample
     if index < 3:
         for i, item in enumerate(sample):
             print(type(item), item)
         print()
     else:
         break

Run the script create_librispeech.py
```
 $ python create_librispeech.py
```

Here is an example result. Each sample is a tuple of a PyTorch Tensor, a string for English text and some integer numbers.

 <class 'torch.Tensor'> tensor([[-0.0065, -0.0055, -0.0062,  ...,  0.0033,  0.0005, -0.0095]])
 <class 'int'> 16000
 <class 'str'> CHAPTER ONE MISSUS RACHEL LYNDE IS SURPRISED MISSUS RACHEL LYNDE LIVED JUST WHERE THE AVONLEA MAIN ROAD DIPPED DOWN INTO A LITTLE HOLLOW FRINGED WITH ALDERS AND LADIES EARDROPS AND TRAVERSED BY A BROOK
 <class 'int'> 103
 <class 'int'> 1240
 <class 'int'> 0

 <class 'torch.Tensor'> tensor([[-0.0059, -0.0045, -0.0067,  ...,  0.0007,  0.0034,  0.0047]])
 <class 'int'> 16000
 <class 'str'> THAT HAD ITS SOURCE AWAY BACK IN THE WOODS OF THE OLD CUTHBERT PLACE IT WAS REPUTED TO BE AN INTRICATE HEADLONG BROOK IN ITS EARLIER COURSE THROUGH THOSE WOODS WITH DARK SECRETS OF POOL AND CASCADE BUT BY THE TIME IT REACHED LYNDE'S HOLLOW IT WAS A QUIET WELL CONDUCTED LITTLE STREAM
 <class 'int'> 103
 <class 'int'> 1240
 <class 'int'> 1

 <class 'torch.Tensor'> tensor([[ 0.0052,  0.0074,  0.0113,  ..., -0.0007, -0.0039, -0.0058]])
 <class 'int'> 16000
 <class 'str'> FOR NOT EVEN A BROOK COULD RUN PAST MISSUS RACHEL LYNDE'S DOOR WITHOUT DUE REGARD FOR DECENCY AND DECORUM IT PROBABLY WAS CONSCIOUS THAT MISSUS RACHEL WAS SITTING AT HER WINDOW KEEPING A SHARP EYE ON EVERYTHING THAT PASSED FROM BROOKS AND CHILDREN UP
 <class 'int'> 103
 <class 'int'> 1240
 <class 'int'> 2

Change the file create_librispeech.py as follows. For each sample, the waveform.pth makes the writer save the audio wave form in the PyTorch Tensor format. The transcript.text makes the writer save the English transcript as a text. Other keys end with .id to save as integer number.

 import torchaudio
 import webdataset as wds
 import sys
 import os

 dataset = torchaudio.datasets.LIBRISPEECH(
     root="./temp", folder_in_archive="LibriSpeech", url="train-clean-100",
     download=True)
 filename = "LibriSpeech.tar"

 sink = wds.TarWriter(filename)
 for index, sample in enumerate(dataset):
     waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample
     if index < 3:
         for i, item in enumerate(sample):
             print(type(item), item)
         print()
     sink.write({
         "__key__": "sample%06d" % index,
         "waveform.pth": waveform,
         "sample_rate.id": sample_rate,
         "transcript.text": transcript,
         "speaker.id": speaker_id,
         "chapter.id": chapter_id,
         "utterance.id": utterance_id
     })
 sink.close()

(Optional) Run the script create_librispeech.py to create a tar archive named LibriSpeech.tar. The result is a large tar archive with a size of 22GB. This approach creates a bigger size than the original dataset (6GB) as the result of decoding all the audio files into a Tensor and save into the tar archive file.
```
 $ python create_librispeech.py
```

Change the create_librispeech.py as follows to save the audio in FLAC format. This script reads the raw bytes from the .flac files and saves with waveform.flac

 import torchaudio
 import webdataset as wds
 import sys
 import os

 dataset = torchaudio.datasets.LIBRISPEECH(
     root="./temp", folder_in_archive="LibriSpeech", url="train-clean-100",
     download=True)
 filename = "LibriSpeech.tar"
 root_dir = os.path.join("./temp", "LibriSpeech", "train-clean-100")

 sink = wds.TarWriter(filename)
 for index, sample in enumerate(dataset):
     waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample
     if index < 3:
         for i, item in enumerate(sample):
             print(type(item), item)
         print()
     # Load audio
     fname = f"{speaker_id}-{chapter_id}-{utterance_id:04d}.flac"
     fpath = os.path.join(root_dir, str(speaker_id), str(chapter_id), fname)

     with open(fpath, "rb") as file:
         sink.write({
             "__key__": "sample%06d" % index,
             "waveform.flac": file.read(),
             "sample_rate.id": sample_rate,
             "transcript.text": transcript,
             "speaker.id": speaker_id,
             "chapter.id": chapter_id,
             "utterance.id": utterance_id
         })
 sink.close()

Run the script create_librispeech.py to create a tar archive named LibriSpeech.tar. The result is a tar archive with a size of 6GB.
```
 $ python create_librispeech.py
```

Upload LibriSpeech to the Vultr Object Storage

Run the following command to upload the LibriSpeech.tar to the Vultr Object Storage

$ s3cmd put LibriSpeech.tar s3://demo-bucket

Load LibriSpeech with WebDataset

Create a file named load_librispeech.py as follows. The code decode(wds.torch_audio) makes the WebDataset use the torchaudio to decode the audio.

 import torch
 from torch.utils.data import IterableDataset
 from torchvision import transforms
 import webdataset as wds
 from itertools import islice

 url = "pipe:s3cmd -q get s3://demo-bucket/LibriSpeech.tar -"
 dataset = wds.WebDataset(url).decode(wds.torch_audio).to_tuple("waveform.flac",
                                                                "sample_rate.id",
                                                                "transcript.text",
                                                                "speaker.id",
                                                                "chapter.id",
                                                                "utterance.id")

 for sample in islice(dataset, 0, 3):
     waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample
     for i, item in enumerate(sample):
         print(type(item), item, )
     print()

Run the script load_librispeech.py
```
 $ python load_librispeech.py
```

Here is an example result. The input and output match the previous step.

 <class 'tuple'> (tensor([[-0.0065, -0.0055, -0.0062,  ...,  0.0033,  0.0005, -0.0095]]), 16000)
 <class 'int'> 16000
 <class 'str'> CHAPTER ONE MISSUS RACHEL LYNDE IS SURPRISED MISSUS RACHEL LYNDE LIVED JUST WHERE THE AVONLEA MAIN ROAD DIPPED DOWN INTO A LITTLE HOLLOW FRINGED WITH ALDERS AND LADIES EARDROPS AND TRAVERSED BY A BROOK
 <class 'int'> 103
 <class 'int'> 1240
 <class 'int'> 0

 <class 'tuple'> (tensor([[-0.0059, -0.0045, -0.0067,  ...,  0.0007,  0.0034,  0.0047]]), 16000)
 <class 'int'> 16000
 <class 'str'> THAT HAD ITS SOURCE AWAY BACK IN THE WOODS OF THE OLD CUTHBERT PLACE IT WAS REPUTED TO BE AN INTRICATE HEADLONG BROOK IN ITS EARLIER COURSE THROUGH THOSE WOODS WITH DARK SECRETS OF POOL AND CASCADE BUT BY THE TIME IT REACHED LYNDE'S HOLLOW IT WAS A QUIET WELL CONDUCTED LITTLE STREAM
 <class 'int'> 103
 <class 'int'> 1240
 <class 'int'> 1

 <class 'tuple'> (tensor([[ 0.0052,  0.0074,  0.0113,  ..., -0.0007, -0.0039, -0.0058]]), 16000)
 <class 'int'> 16000
 <class 'str'> FOR NOT EVEN A BROOK COULD RUN PAST MISSUS RACHEL LYNDE'S DOOR WITHOUT DUE REGARD FOR DECENCY AND DECORUM IT PROBABLY WAS CONSCIOUS THAT MISSUS RACHEL WAS SITTING AT HER WINDOW KEEPING A SHARP EYE ON EVERYTHING THAT PASSED FROM BROOKS AND CHILDREN UP
 <class 'int'> 103
 <class 'int'> 1240
 <class 'int'> 2

Run the script load_cifar10.py
```
 $ python load_cifar10.py
```

Use Sharding with WebDataset

WebDataset supports sharding to split the dataset into many shards to achieve parallel I/O and shuffle data by shards.

Here is an example of creating and loading the LibriSpeech dataset with Sharding.

Create a file named create_librispeech_sharding.py as follows:

 import torchaudio
 import webdataset as wds
 import sys
 import os

 dataset = torchaudio.datasets.LIBRISPEECH(
     root="./temp", folder_in_archive="LibriSpeech", url="train-clean-100",
     download=True)
 root_dir = os.path.join("./temp", "LibriSpeech", "train-clean-100")

 filename = "LibriSpeech-%04d.tar"
 sink = wds.ShardWriter(filename, maxsize=1e9)
 for index, sample in enumerate(dataset):
     waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample
     if index < 3:
         for i, item in enumerate(sample):
             print(type(item), item)
         print()
     # Load audio
     fname = f"{speaker_id}-{chapter_id}-{utterance_id:04d}.flac"
     fpath = os.path.join(root_dir, str(speaker_id), str(chapter_id), fname)

     with open(fpath, "rb") as file:
         sink.write({
             "__key__": "sample%06d" % index,
             "waveform.flac": file.read(),
             "sample_rate.id": sample_rate,
             "transcript.text": transcript,
             "speaker.id": speaker_id,
             "chapter.id": chapter_id,
             "utterance.id": utterance_id
         })
 sink.close()

Run the script create_librispeech_sharding.py to create multiple shards with 1GB for each shard.
```
 $ python create_librispeech_sharding.py
```
Upload all the shards into the Vultr Object Storage
```
 $ s3cmd put LibriSpeech-* s3://demo-bucket
```

Create a file named load_librispeech_sharding.py as follows. The code shardshuffle=True makes the WebDataset shuffle the dataset based on the shards and shuffle the samples inline with the shuffle method.

 import torch
 from torch.utils.data import IterableDataset
 from torchvision import transforms
 import webdataset as wds
 from itertools import islice

 url = "pipe:s3cmd -q get s3://demo-bucket/LibriSpeech-{0000..0006}.tar -"
 dataset = wds.WebDataset(url, shardshuffle=True).shuffle(100).decode(wds.torch_audio).to_tuple("waveform.flac",
                                                                                                "sample_rate.id",
                                                                                                "transcript.text",
                                                                                                "speaker.id",
                                                                                                "chapter.id",
                                                                                                "utterance.id")

 for sample in islice(dataset, 0, 3):
     waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample
     for i, item in enumerate(sample):
         print(type(item), item, )
     print()