Transcribing and Translating Audio | Generative AI Series

Introduction

Whisper is a foundation model from OpenAI. You can use this model to convert speech to text (transcription) or to automatically translate text from one language to another (translation).

Foundation models are trained on massive amounts of data and form the basis of more advanced or specialized models. For instance, OpenAI trains the Whisper model with an audio dataset containing more than 680,000 hours and 1.4 trillion words. The massive dataset allows Whisper to learn patterns and relationships when performing natural language processing (NLP) tasks.

In this article, you'll deploy a Vultr Cloud GPU and install the required libraries to implement OpenAI's Whisper model with Python to transcribe audio and translate text.

Prerequisites

Before you begin:

Deploy a new Ubuntu 22.04 A100 Vultr Cloud GPU Server with at least:
- 80 GB GPU RAM
- 12 vCPUs
- 120 GB Memory
Establish an SSH connection to the server.
Create a non-root user with sudo rights and switch to the account.
Create a HuggingFace account.
Create a Hugging Face user access token.

Install the FFmpeg Package

The Whisper model requires the FFmpeg package. This package has many useful libraries for processing multimedia content, such as audio and video. Follow the steps below to install FFmpeg:

Choose the appropriate command to install FFmpeg. For Ubuntu and Arch, run the commands below:
- Ubuntu or Debian:
  console
  $ sudo apt update $ sudo apt install ffmpeg
- Arch Linux:
  console
  $ sudo pacman -S ffmpeg
Use pip to install the openai-whisper model.
console
```
$ pip install openai-whisper
```

Ensure you've installed the Whisper model by checking its version.

                            console
                            
$ pip show openai-whisper

Output:

Name: openai-whisper
Version: 20231117
Summary: Robust Speech Recognition via Large-Scale Weak Supervision
Home-page: https://github.com/openai/whisper

Transcribe an Audio File With Python

In this section, you'll download a sample audio file from Steve Jobs, the visionary co-founder of Apple. Then, you'll use the Whisper model with Python to transcribe the audio file to text. Follow the steps below:

Download the sample steve-jobs.mp3 file using the Linux wget command.

console

$ wget https://docs.vultr.com/public/doc-assets/new/implementing-audio-transcription-with-translation-genai-series/steve-jobs.mp3

Create a new transcribe.py file using a text editor like nano.
console
```
$ nano transcribe.py
```

Enter the following information into the transcribe.py file. In the following file, you're loading the sample audio file from Steve Jobs and transcribing the audio into a text output.

                            python
                            
import whisper
import IPython

audio_file = "steve-jobs.mp3"

IPython.display.Audio(audio_file)

model = whisper.load_model("medium")

result = model.transcribe(audio_file)

print(result["text"].strip())

Save and close the file.
Run the transcribe.py file.
console
```
$ python3 transcribe.py
```

Verify the following output.

I'm honored to be with you today for your commencement from one of the finest universities in the world. Truth be told, I never graduated from college, and this is the closest I've ever gotten to a college graduation. Today, I want to tell you three stories from 
...
down the road will give you the confidence to follow your heart even when it leads you off the well-worn path, and that will make all the difference.

Translate Audio File from Spanish to English

In addition to transcribing text, you can use the OpenAI's Whisper model to translate text into different languages. Follow the steps below:

Download a sample spanish.mp3 file.

console

$ wget https://docs.vultr.com/public/doc-assets/new/implementing-audio-transcription-with-translation-genai-series/spanish.mp3

Create a new translate.py file.
console
```
$ nano translate.py
```

Enter the following information into the translate.py file. You're using the Whisper model to load the Spanish audio sample file in the following file. Then, you're defining a task to convert the audio sample to English.

                            python
                            
import IPython
import whisper

audio_file = "spanish.mp3"

IPython.display.Audio(audio_file)

model  = whisper.load_model("medium")
result = model.transcribe(audio_file, task = 'translate')

print(result["text"].strip())

Run the translate.py file.
console
```
$ python3 translate.py
```

Verify the following output.

What do you think artificial intelligence is? What do I think it is? I don't know how to describe it. Something that is not natural, obviously. Artificial intelligence is, through data, introducing an algorithm

Conclusion

In this article, you explored how to use OpenAI's Whisper foundation model to transcribe and translate sample audio files. You started with transcribing an English audio file to English text. Then, you've also translated a Spanish audio sample file to English text.

Tags:

FFmpeg

Generative AI

Python