Voice Swap using NVIDIA NeMo on Vultr Cloud GPU

Updated on October 16, 2023
Voice Swap using NVIDIA NeMo on Vultr Cloud GPU header image


Neural Modules (NeMo) is an open-source toolkit designed for users that work with conversational AI. It's part of the NVIDIA GPU Cloud (NGC) collection that includes a library of tools, and ready-to-use models designed to efficiently handle artificial Intelligence and high-performance computing projects.

This article explains how to perform voice swap using the NVIDIA NeMo framework on a Vultr Cloud GPU server. You are to perform tasks such as Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) using a PyTotrch GPU accelerated container from the NGC Catalog. In addition, convert an English Male voice audio sample to an English Female voice audio sample by running the pre-trained NeMo models for Natural Processing Tasks (NLP).


Before you begin, be sure to:

Deploy the PyTorch GPU Container and Access Jupyter Notebook

In this section, you are to install and run the PyTorch GPU container with port binding and access the Jupyter Notebook pre-installed in the container.

  1. Install and run the PyTorch GPU container

     $ sudo docker run --gpus all -p 9000:8888 -it nvcr.io/nvidia/pytorch:23.09-py3

    The above command runs the PyTorch GPU-accelerated container with the following values:

    • --gpus all: Allocates all available host server GPU resources to the container
    • -p 9000:8888: Maps the host port 9000 to the PyTorch container port 8888 to access Jupyter Notebook different from the host instance
    • -it: Interactively starts a new shell session of the container terminal

    When successful, verify that you can access the container shell

  2. Start a new Jupyter Notebook instance

     # jupyter notebook --ip=

    Your output should look like the one below:

         To access the notebook, open this file in a browser:
     Or copy and paste this URL:

    Copy your generated access token to securely access the Jupyter Notebook instance in your web browser

  3. In a web browser such as Chrome, access Jupyter Notebook using the generated access token


Run the Pre-Trained Models

In this section, install the required libraries to use the models and necessary NeMo functions. Then, import the NeMo modules, initialize the pre-trained models, and perform voice swap tasks as described in the steps below.

  1. Access your Jupyter Notebook web interface

  2. On the middle right bar, click the New dropdown to reveal the options list

    Create a new Jupyter Notebook

  3. Click Notebook, and select Python 3 (ipykernel) to open a new file

  4. In a new code cell, install dependency packages

     !pip install Cython nemo_toolkit[all] hydra-core transformers sentencepiece webdataset youtokentome pyannote.metrics jiwer ijson sacremoses sacrebleu rouge_score einops unidic-lite mecab-python3 opencc pangu ipadic wandb nemo_text_processing pytorch-lightning

    Below is what each package represents:

    • Cython: A Python superset that allows you to write C extensions for Python often used for performance optimization
    • nemo_toolkit[all]: A framework for building conversational AI models. The [all] flag installs all available NeMo components and dependencies
    • hydra-core: A framework for configuring complex applications. It's used to manage configuration settings in a clean and organized way
    • transformers: Works with pre-trained models in Natural Language Processing (NLP), including models like BERT, GPT-2, and more
    • sentencepiece: Handles text tokenization and segmentation, often used in NLP tasks
    • webdataset: Enables efficient data loading and augmentation, particularly useful in deep learning workflows
    • youtokentome: A subword tokenization library useful in language modeling tasks
    • pyannote.metrics: A toolkit for speaker diarization and audio analysis that contains evaluation metrics for these tasks
    • jiwer: A Computing the Word Error Rate (WER) library commonly used in Automatic Speech Recognition (ASR) and other speech processing tasks
    • ijson: A library for parsing large JSON documents incrementally. It's useful when working with large data files
    • sacremoses: Handles tokenization, detokenization, and various text processing tasks
    • sacrebleu: Evaluates machine translation quality using the BLEU metric
    • rouge_score: A library for computing the ROUGE evaluation metric that is often used in text summarization and machine translation
    • einops: Handles tensor operations and reshaping, which can be helpful in deep learning model development
    • unidic-lite: A morphological analysis dictionary, and it is a lightweight version of it
    • mecab-python3: A tokenizer and part-of-speech tagger. This package is the Python binding for MeCab
    • opencc: A simplified and traditional Chinese text conversion library
    • pangu: A Chinese text spacing library that adds spaces between Chinese characters
    • ipadic: A morphological analysis dictionary
    • wandb: Tracks and visualizes machine learning experiments
    • nemo_text_processing: Contains text processing utilities specific to the NVidia NeMo Toolkit
    • pytorch-lightning: A PyTorch lightweight wrapper that simplifies the training of Deep Learning models
  5. Import the necessary modules

     import nemo
     import nemo.collections.asr as nemo_asr
     import nemo.collections.nlp as nemo_nlp
     import nemo.collections.tts as nemo_tts
     import IPython

    Below is what each of the imported modules represents:

    • nemo: Allows you to access the NeMo functionalities and classes
    • nemo.collections.asr: Enables access to the NeMo ASR-related functionalities and models
    • nemo_nlp: Allows you to use the NeMo NLP-related tools, models, and utilities
    • nemo_tts: Allows you to use the NeMo TTS-related functionalities and models
    • IPython: Allows you to interactively run and experiment with NeMo code
  6. Open the NGC NeMo catalog


    The above commands output the list of available models from the following catalogs

    • Automatic Speech Recognition (ASR) models that use the encode-decoder
    • Test-to-Speech models from the HifiGan and FastPitch categories respectively

    From the above available list, use the following models:

    • stt_en_quartznet15x5: To handle speech recognition tasks, specific to only the English language
    • tts_en_fastpitch: Generate spectrogram for text input to text-to-speech, specific to the English language
    • tts_en_hifigan: Converts spectrograms into speech for TTS, specific to the English language
  7. Download and initialize the models

     quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name='stt_en_quartznet15x5').cuda()
     spec_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name='tts_en_fastpitch').cuda()
     vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_en_hifigan").cuda()

    The download and initialization may take up to 15 minutes to complete.

Perform Voice Swapping

  1. Import the audio sample. Replace the URL with your desired audio source

     Audio_sample = '2086-149220-0033.wav'
     !wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav

    The above command downloads an English audio .wav file with the male voice from the provided URL. Then, it uses IPython.display.Audio to display and play the audio in your Jupyter Notebook file.

  2. Transcribe the audio sample

     files = [Audio_sample]
     raw_text = ''
     text = ''
     for fname, transcription in zip(files,quartznet.transcribe(paths2audio_files=files)):
       raw_text = transcription
     text = raw_text[0]

    The above command processes the provided audio sample using the Quartznet model for transcription.


     well i don't wish to see it any more observed phoebe turning away her eyes it is certainly very like the old portrait
  3. Generate the spectrogram

     def text_to_audio(text):
       parsed = spec_generator.parse(text)
       spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
       audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
       return audio.to('cpu').detach().numpy()

    In the above command, the text_to_audio function takes a transcript, parses it, and generates a spectrogram using the text-to-speech model tts_en_hifigan. This is a preprocessing step in text-to-speech synthesis, the spectrogram represents the special characteristics of the generated audio.

  4. Generate the swapped audio


    The above command displays the swapped audio sample converted from a male English voice to a female English voice.


You have built an AI voice swap system using the NeMo framework pre-trained models running on an NGC GPU accelerated container. You converted an English Male voice audio sample to an English Female voice audio sample. Using NeMo modules and pre-trained models from the NGC catalog, the audio speech recognition pipeline becomes more efficient and convenient to use.

More Information

For more information, visit the following resources: