How to Use WebDataset and PyTorch with Vultr Object Storage

Updated on November 21, 2023
How to Use WebDataset and PyTorch with Vultr Object Storage header image

Introduction

WebDataset is a PyTorch Dataset implementation to work with large-scale datasets efficiently. WebDataset provides sequential/streaming access directly to datasets stored in tar archives in a local disk or a cloud storage object for training without unpacking and can stream data with no local storage. With WebDataset, you can scale up the same code from running local experiments to using hundreds of GPUs. This article explains how to use WebDataset and PyTorch with tar archives stored on Vultr Object Storage.

At the end of this article, you know:

  • How to set up s3cmd to connect with a Vultr Object Storage.
  • How to create an image classification dataset (CIFAR10), upload it to Object Storage, and load the data for training.
  • How to create a speech recognition dataset (LibriSpeech) with different encoding approaches.
  • How to split the dataset into multiple shards to achieve parallel loading and shuffle by shards.

Prerequisites

Install WebDataset and S3cmd

Install webdataset and s3cmd using Python package manager.

$ pip install webdataset s3cmd

Configure s3cmd with Vultr Object Storage

  1. Run the configuration and enter the information of your Vultr Object Storage

     $ s3cmd --configure
  2. Here is an example setup

     Enter new values or accept defaults in brackets with Enter.
     Refer to user manual for detailed description of all options.
    
     Access key and Secret key are your identifiers for Amazon S3. Leave them empty for using the env variables.
     Access Key: WSZ3GHRPA189CVSGRKU6
     Secret Key: uKCwHbdmUEqTbjVno3k373rfORzSam0lWAXbaAm6
     Default Region [US]:
    
     Use "s3.amazonaws.com" for S3 Endpoint and not modify it to the target Amazon S3.
     S3 Endpoint [s3.amazonaws.com]: sgp1.vultrobjects.com
    
     Use "%(bucket)s.s3.amazonaws.com" to the target Amazon S3. "%(bucket)s" and "%(location)s" vars can be used
     if the target S3 system supports dns based buckets.
     DNS-style bucket+hostname:port template for accessing a bucket [%(bucket)s.s3.amazonaws.com]: sgp1.vultrobjects.com
    
     Encryption password is used to protect your files from reading
     by unauthorized persons while in transfer to S3
     Encryption password:
     Path to GPG program [/usr/bin/gpg]:
    
     When using secure HTTPS protocol all communication with Amazon S3
     servers is protected from 3rd party eavesdropping. This method is
     slower than plain HTTP, and can only be proxied with Python 2.7 or newer
     Use HTTPS protocol [Yes]:
    
     On some networks all internet access must go through a HTTP proxy.
     Try setting it here if you can't connect to S3 directly
     HTTP Proxy server name:
    
     New settings:
       Access Key: WSZ3GHRPA189CVSGRKU6
       Secret Key: uKCwHbdmUEqTbjVno3k373rfORzSam0lWAXbaAm6
       Default Region: US
       S3 Endpoint: sgp1.vultrobjects.com
       DNS-style bucket+hostname:port template for accessing a bucket: sgp1.vultrobjects.com
       Encryption password:
       Path to GPG program: /usr/bin/gpg
       Use HTTPS protocol: True
       HTTP Proxy server name:
       HTTP Proxy server port: 0
    
     Test access with supplied credentials? [Y/n]
     Please wait, attempting to list all buckets...
     Success. Your access key and secret key worked fine :-)
    
     Now verifying that encryption works...
     Not configured. Never mind.
    
     Save settings? [y/N] y
     Configuration saved to '/home/ubuntu/.s3cfg'
  3. Make a bucket to store the dataset. Replace demo-bucket with your bucket name.

     $ s3cmd mb s3://demo-bucket

Create a WebDataset for Image Recognition

This section shows how to create a tar archive for the CIFAR10 dataset, upload it to a Vultr Object Storage, and load the data for training.

This section uses torchvision library to simplify the explanation of processing the dataset. However, you can use any method to load each sample into separate variables.

Create CIFAR10 WebDataset

  1. Create a file named create_cifar10.py as follows:

     import torchvision
     import webdataset as wds
     import sys
    
     dataset = torchvision.datasets.CIFAR10(root="./temp", download=True)
    
     for index, (input, output) in enumerate(dataset):
         if index < 3:
             print("input", type(input), input)
             print("output", type(output), output)
             print("")
         else:
             break
  2. Run the script create_cifar10.py

     $ python create_cifar10.py
  3. Here is an example result. Each sample is a tuple of a Python Imaging Library (PIL) Image and an integer number as the category label.

     Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./temp/cifar-10-python.tar.gz
     100.0%
     Extracting ./temp/cifar-10-python.tar.gz to ./temp
     input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7F373B6CA6B0>
     output <class 'int'> 6
    
     input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7F373B6CA680>
     output <class 'int'> 9
    
     input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7F373B6CA6E0>
     output <class 'int'> 9
  4. Change the file create_cifar10.py as follows. For each sample, an instance of wds.TarWriter saves a dictionary of input with key ppm and output with key cls. The key ppm makes the writer save the image with an image encoder. The key cls makes the writer save the label as an integer.

     import torchvision
     import webdataset as wds
     import sys
    
     dataset = torchvision.datasets.CIFAR10(root="./temp", download=True)
     filename = "cifar10.tar"
    
     sink = wds.TarWriter(filename)
     for index, (input, output) in enumerate(dataset):
         if index == 0:
             print("input", type(input), input)
             print("output", type(output), output)
         sink.write({
             "__key__": "sample%06d" % index,
             "ppm": input,
             "cls": output,
         })
     sink.close()
  5. Run the script create_cifar10.py to create a tar archive named cifar10.tar.

     $ python create_cifar10.py

Upload CIFAR10 to the Vultr Object Storage

Run the following command to upload the cifar10.tar to the Vultr Object Storage

$ s3cmd put cifar10.tar s3://demo-bucket

Load CIFAR10 with WebDataset

  1. Create a file named load_cifar10.py as follows. The code decode("pil") makes the WebDataset decode the input image into a PIL image instead of raw bytes. The code to_tuple("ppm", "cls") makes the dataset return a tuple from key "ppm" and "cls".

     import torch
     import webdataset as wds
     from itertools import islice
    
     url = "pipe:s3cmd -q get s3://demo-bucket/cifar10.tar -"
     dataset = wds.WebDataset(url).decode("pil").to_tuple("ppm", "cls")
    
     for sample in islice(dataset, 0, 3):
         input, output = sample
         print("input", type(input), input)
         print("output", type(output), output)
         print()
  2. Run the script load_cifar10.py

     $ python load_cifar10.py
  3. Here is an example result. The input and output match the previous step.

     input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7FB50AE18E20>
     output <class 'int'> 6
    
     input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7FB50ACDE230>
     output <class 'int'> 9
    
     input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7FB50AE19AB0>
     output <class 'int'> 9
  4. (Optional) Change the load_cifar10.py to perform data augmentation and normalization.

     import torch
     import webdataset as wds
     from itertools import islice
     from torchvision import transforms
    
     def identity(x):
         return x
    
     normalize = transforms.Normalize(
         mean=[0.485, 0.456, 0.406],
         std=[0.229, 0.224, 0.225])
    
     preprocess = transforms.Compose([
         transforms.RandomResizedCrop(32),
         transforms.RandomHorizontalFlip(),
         transforms.ToTensor(),
         normalize,
     ])
    
     url = "pipe:s3cmd -q get s3://demo-bucket/cifar10.tar -"
     dataset = (
         wds.WebDataset(url)
         .shuffle(64)
         .decode("pil")
         .to_tuple("ppm", "cls")
         .map_tuple(preprocess, identity)
     )
     for sample in islice(dataset, 0, 3):
         input, output = sample
         print("input", type(input), input)
         print("output", type(output), output)
         print()

Data Loading and Preprocessing with DataLoader

The dataset from WebDataset is a standard PyTorch IterableDataset instance. WebDataset is fully compatible with the standard PyTorch DataLoader, replicating the dataset instance across multiple threads and performing parallel data loading and preprocessing.

Here is an example of using the standard PyTorch DataLoader

loader = torch.utils.data.DataLoader(dataset, num_workers=4, batch_size=8)

batch = next(iter(loader))
batch[0].shape, batch[1].shape

The authors of WebDataset recommend explicitly batching in the dataset instance as follows:

batch_size = 20
dataloader = torch.utils.data.DataLoader(dataset.batched(batch_size), num_workers=4, batch_size=None)
images, targets = next(iter(dataloader))
images.shape

If you want to change the batch size dynamically, WebDataset provides a wrapper that adds a fluid interface to the standard PyTorch DataLoader. Here is an example of using the WebLoader from WebDataset

dataset = dataset.batched(16)

loader = wds.WebLoader(dataset, num_workers=4, batch_size=None)
loader = loader.unbatched().shuffle(1000).batched(12)

batch = next(iter(loader))
batch[0].shape, batch[1].shape

Create a WebDataset for Speech Recognition

This section shows how to create a tar archive for LibriSpeech dataset. LibriSpeech is a corpus of about 1000 hours of 16kHz read English speech. Each sample contains audio in the Free Lossless Audio Codec (FLAC) format, an English text, and some integer numbers.

This section uses torchaudio library to simplify the explanation of processing the dataset. However, you can use any method to load each sample into separate variables.

Create LibriSpeech WebDataset

  1. Create a file named create_librispeech.py as follows:

     import torchaudio
     import webdataset as wds
     import sys
     import os
    
     dataset = torchaudio.datasets.LIBRISPEECH(
         root="./temp", folder_in_archive="LibriSpeech", url="train-clean-100",
         download=True)
    
     for index, sample in enumerate(dataset):
         waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample
         if index < 3:
             for i, item in enumerate(sample):
                 print(type(item), item)
             print()
         else:
             break
  2. Run the script create_librispeech.py

     $ python create_librispeech.py
  3. Here is an example result. Each sample is a tuple of a PyTorch Tensor, a string for English text and some integer numbers.

     <class 'torch.Tensor'> tensor([[-0.0065, -0.0055, -0.0062,  ...,  0.0033,  0.0005, -0.0095]])
     <class 'int'> 16000
     <class 'str'> CHAPTER ONE MISSUS RACHEL LYNDE IS SURPRISED MISSUS RACHEL LYNDE LIVED JUST WHERE THE AVONLEA MAIN ROAD DIPPED DOWN INTO A LITTLE HOLLOW FRINGED WITH ALDERS AND LADIES EARDROPS AND TRAVERSED BY A BROOK
     <class 'int'> 103
     <class 'int'> 1240
     <class 'int'> 0
    
     <class 'torch.Tensor'> tensor([[-0.0059, -0.0045, -0.0067,  ...,  0.0007,  0.0034,  0.0047]])
     <class 'int'> 16000
     <class 'str'> THAT HAD ITS SOURCE AWAY BACK IN THE WOODS OF THE OLD CUTHBERT PLACE IT WAS REPUTED TO BE AN INTRICATE HEADLONG BROOK IN ITS EARLIER COURSE THROUGH THOSE WOODS WITH DARK SECRETS OF POOL AND CASCADE BUT BY THE TIME IT REACHED LYNDE'S HOLLOW IT WAS A QUIET WELL CONDUCTED LITTLE STREAM
     <class 'int'> 103
     <class 'int'> 1240
     <class 'int'> 1
    
     <class 'torch.Tensor'> tensor([[ 0.0052,  0.0074,  0.0113,  ..., -0.0007, -0.0039, -0.0058]])
     <class 'int'> 16000
     <class 'str'> FOR NOT EVEN A BROOK COULD RUN PAST MISSUS RACHEL LYNDE'S DOOR WITHOUT DUE REGARD FOR DECENCY AND DECORUM IT PROBABLY WAS CONSCIOUS THAT MISSUS RACHEL WAS SITTING AT HER WINDOW KEEPING A SHARP EYE ON EVERYTHING THAT PASSED FROM BROOKS AND CHILDREN UP
     <class 'int'> 103
     <class 'int'> 1240
     <class 'int'> 2
  4. Change the file create_librispeech.py as follows. For each sample, the waveform.pth makes the writer save the audio wave form in the PyTorch Tensor format. The transcript.text makes the writer save the English transcript as a text. Other keys end with .id to save as integer number.

     import torchaudio
     import webdataset as wds
     import sys
     import os
    
     dataset = torchaudio.datasets.LIBRISPEECH(
         root="./temp", folder_in_archive="LibriSpeech", url="train-clean-100",
         download=True)
     filename = "LibriSpeech.tar"
    
     sink = wds.TarWriter(filename)
     for index, sample in enumerate(dataset):
         waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample
         if index < 3:
             for i, item in enumerate(sample):
                 print(type(item), item)
             print()
         sink.write({
             "__key__": "sample%06d" % index,
             "waveform.pth": waveform,
             "sample_rate.id": sample_rate,
             "transcript.text": transcript,
             "speaker.id": speaker_id,
             "chapter.id": chapter_id,
             "utterance.id": utterance_id
         })
     sink.close()
  5. (Optional) Run the script create_librispeech.py to create a tar archive named LibriSpeech.tar. The result is a large tar archive with a size of 22GB. This approach creates a bigger size than the original dataset (6GB) as the result of decoding all the audio files into a Tensor and save into the tar archive file.

     $ python create_librispeech.py
  6. Change the create_librispeech.py as follows to save the audio in FLAC format. This script reads the raw bytes from the .flac files and saves with waveform.flac

     import torchaudio
     import webdataset as wds
     import sys
     import os
    
     dataset = torchaudio.datasets.LIBRISPEECH(
         root="./temp", folder_in_archive="LibriSpeech", url="train-clean-100",
         download=True)
     filename = "LibriSpeech.tar"
     root_dir = os.path.join("./temp", "LibriSpeech", "train-clean-100")
    
     sink = wds.TarWriter(filename)
     for index, sample in enumerate(dataset):
         waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample
         if index < 3:
             for i, item in enumerate(sample):
                 print(type(item), item)
             print()
         # Load audio
         fname = f"{speaker_id}-{chapter_id}-{utterance_id:04d}.flac"
         fpath = os.path.join(root_dir, str(speaker_id), str(chapter_id), fname)
    
         with open(fpath, "rb") as file:
             sink.write({
                 "__key__": "sample%06d" % index,
                 "waveform.flac": file.read(),
                 "sample_rate.id": sample_rate,
                 "transcript.text": transcript,
                 "speaker.id": speaker_id,
                 "chapter.id": chapter_id,
                 "utterance.id": utterance_id
             })
     sink.close()
  7. Run the script create_librispeech.py to create a tar archive named LibriSpeech.tar. The result is a tar archive with a size of 6GB.

     $ python create_librispeech.py

Upload LibriSpeech to the Vultr Object Storage

Run the following command to upload the LibriSpeech.tar to the Vultr Object Storage

$ s3cmd put LibriSpeech.tar s3://demo-bucket

Load LibriSpeech with WebDataset

  1. Create a file named load_librispeech.py as follows. The code decode(wds.torch_audio) makes the WebDataset use the torchaudio to decode the audio.

     import torch
     from torch.utils.data import IterableDataset
     from torchvision import transforms
     import webdataset as wds
     from itertools import islice
    
     url = "pipe:s3cmd -q get s3://demo-bucket/LibriSpeech.tar -"
     dataset = wds.WebDataset(url).decode(wds.torch_audio).to_tuple("waveform.flac",
                                                                    "sample_rate.id",
                                                                    "transcript.text",
                                                                    "speaker.id",
                                                                    "chapter.id",
                                                                    "utterance.id")
    
     for sample in islice(dataset, 0, 3):
         waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample
         for i, item in enumerate(sample):
             print(type(item), item, )
         print()
  2. Run the script load_librispeech.py

     $ python load_librispeech.py
  3. Here is an example result. The input and output match the previous step.

     <class 'tuple'> (tensor([[-0.0065, -0.0055, -0.0062,  ...,  0.0033,  0.0005, -0.0095]]), 16000)
     <class 'int'> 16000
     <class 'str'> CHAPTER ONE MISSUS RACHEL LYNDE IS SURPRISED MISSUS RACHEL LYNDE LIVED JUST WHERE THE AVONLEA MAIN ROAD DIPPED DOWN INTO A LITTLE HOLLOW FRINGED WITH ALDERS AND LADIES EARDROPS AND TRAVERSED BY A BROOK
     <class 'int'> 103
     <class 'int'> 1240
     <class 'int'> 0
    
     <class 'tuple'> (tensor([[-0.0059, -0.0045, -0.0067,  ...,  0.0007,  0.0034,  0.0047]]), 16000)
     <class 'int'> 16000
     <class 'str'> THAT HAD ITS SOURCE AWAY BACK IN THE WOODS OF THE OLD CUTHBERT PLACE IT WAS REPUTED TO BE AN INTRICATE HEADLONG BROOK IN ITS EARLIER COURSE THROUGH THOSE WOODS WITH DARK SECRETS OF POOL AND CASCADE BUT BY THE TIME IT REACHED LYNDE'S HOLLOW IT WAS A QUIET WELL CONDUCTED LITTLE STREAM
     <class 'int'> 103
     <class 'int'> 1240
     <class 'int'> 1
    
     <class 'tuple'> (tensor([[ 0.0052,  0.0074,  0.0113,  ..., -0.0007, -0.0039, -0.0058]]), 16000)
     <class 'int'> 16000
     <class 'str'> FOR NOT EVEN A BROOK COULD RUN PAST MISSUS RACHEL LYNDE'S DOOR WITHOUT DUE REGARD FOR DECENCY AND DECORUM IT PROBABLY WAS CONSCIOUS THAT MISSUS RACHEL WAS SITTING AT HER WINDOW KEEPING A SHARP EYE ON EVERYTHING THAT PASSED FROM BROOKS AND CHILDREN UP
     <class 'int'> 103
     <class 'int'> 1240
     <class 'int'> 2
  4. Run the script load_cifar10.py

     $ python load_cifar10.py

Use Sharding with WebDataset

WebDataset supports sharding to split the dataset into many shards to achieve parallel I/O and shuffle data by shards.

Here is an example of creating and loading the LibriSpeech dataset with Sharding.

  1. Create a file named create_librispeech_sharding.py as follows:

     import torchaudio
     import webdataset as wds
     import sys
     import os
    
     dataset = torchaudio.datasets.LIBRISPEECH(
         root="./temp", folder_in_archive="LibriSpeech", url="train-clean-100",
         download=True)
     root_dir = os.path.join("./temp", "LibriSpeech", "train-clean-100")
    
     filename = "LibriSpeech-%04d.tar"
     sink = wds.ShardWriter(filename, maxsize=1e9)
     for index, sample in enumerate(dataset):
         waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample
         if index < 3:
             for i, item in enumerate(sample):
                 print(type(item), item)
             print()
         # Load audio
         fname = f"{speaker_id}-{chapter_id}-{utterance_id:04d}.flac"
         fpath = os.path.join(root_dir, str(speaker_id), str(chapter_id), fname)
    
         with open(fpath, "rb") as file:
             sink.write({
                 "__key__": "sample%06d" % index,
                 "waveform.flac": file.read(),
                 "sample_rate.id": sample_rate,
                 "transcript.text": transcript,
                 "speaker.id": speaker_id,
                 "chapter.id": chapter_id,
                 "utterance.id": utterance_id
             })
     sink.close()
  2. Run the script create_librispeech_sharding.py to create multiple shards with 1GB for each shard.

     $ python create_librispeech_sharding.py
  3. Upload all the shards into the Vultr Object Storage

     $ s3cmd put LibriSpeech-* s3://demo-bucket
  4. Create a file named load_librispeech_sharding.py as follows. The code shardshuffle=True makes the WebDataset shuffle the dataset based on the shards and shuffle the samples inline with the shuffle method.

     import torch
     from torch.utils.data import IterableDataset
     from torchvision import transforms
     import webdataset as wds
     from itertools import islice
    
     url = "pipe:s3cmd -q get s3://demo-bucket/LibriSpeech-{0000..0006}.tar -"
     dataset = wds.WebDataset(url, shardshuffle=True).shuffle(100).decode(wds.torch_audio).to_tuple("waveform.flac",
                                                                                                    "sample_rate.id",
                                                                                                    "transcript.text",
                                                                                                    "speaker.id",
                                                                                                    "chapter.id",
                                                                                                    "utterance.id")
    
     for sample in islice(dataset, 0, 3):
         waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample
         for i, item in enumerate(sample):
             print(type(item), item, )
         print()
  5. Run the script load_librispeech_sharding.py

     $ python load_librispeech_sharding.py