Video Classification and Understanding using VideoMAE

Introduction

Video classification is a machine learning process that involves categorizing video scenes into classes by assigning labels depending on the video content. The video classification pipeline inputs a video file and generates a prediction based on a list of predefined classes. The process is commonly used in surveillance systems to help identify and classify specific actions or behaviors captured in the input video footage.

This article explains how to implement a video classification pipeline on a Vultr Cloud GPU server to identify human actions using the Video Masked Autoencoders VideoMAE pre-trained model from the Kinetics 400 Dataset. You will load input videos, sample the frames, feed the video frames into VideoMAE processors, and get the action class for each classified element in the input video clip.

Prerequisites

Before you begin:

Deploy a Ubuntu 22.04 A100 Vultr Cloud GPU server with at least 4 GB GPU RAM
Use SSH to access the server as a non-root user with sudo privileges.
Update the server.
Install JupyterLab and PyTorch.

Set Up the Server

Follow the steps below to set up the server by installing all necessary dependency packages, download sample videos, and move them to the Notebook directory to use as the input files for the video classification tasks.

Install all dependency library packages using Pip.
console
```
$ pip install transformers pytorchvideo opencv-python matplotlib ipywidgets gdown
```
The above command installs the following packages on the server.
- transformers: Offers APIs and pipelines for downloading download and using pretrained transformer-based models. The VideoMAE image processor and classification models are part of the transformers library.
- pytorchvideo: Provides reusable, modular and efficient components for running video understanding tasks. VideoMAE internally uses torchvision functions to optimize the video classification process.
- opencv-python: Provides a wide range of image processing functionalities. You will use Opencv to load videos and extract frames.
- matplotlib: Creates static or interactive visualizations.
- ipywidgets: Provides interactive HTML widgets for Jupyter Notebook sessions using IPython kernel.
- gdown: Downloads files from public links such as Google Drive.
Download a sample input file. For example, use the gdown module to download a sample video from a Google Drive link
console
```
$ gdown 1lKgQaafhQFIv6db36Mav72pGBK8CVbYo
```
You can use any other video file for classification in your session. For purposes of this article, use the sample video provided in the Google Drive URL.
Extract sample videos from the videos.zip archive to the notebooks directory using the Jupyter user home path.
console
```
$ sudo unzip videos.zip -d /home/jupyter/notebooks/
```

Extract Sample Frames from a Video File

Select Notebook within the Jupyter interface to create a new Python3 Kernel file.

In a new notebook code cell, import all required libraries.

                            python
                            
                        
import os
import cv2
import math
import torch
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
from transformers import AutoImageProcessor, VideoMAEForVideoClassification

Press Shift + Enter to run the code cell and import libraries.
Load the 4.mp4 sample video from the downloaded files using the openCV VideoCapture function.
python
```
video = cv2.VideoCapture("videos/4.mp4")
```
Define the number of sample frames to use with the VideoMAE model. For its pre-trained Kinetics 400 model, VideoMAE uses 16 frames along the timeline as inputs.
python
```
sample_num = 16
```
Extract the number of total frames from the loaded video clip.
python
```
frame_count = video.get(cv2.CAP_PROP_FRAME_COUNT)
```
Get the indices of all sampled frames that are evenly spaced over a specified interval across the entire video.
python
```
indices = np.linspace(0, frame_count - 1, sample_num).astype(int)
```
Get the sampled frames into a list to use as the inputs of the VideoMAE classification model.
python
```
frames = []
for i in indices:
    video.set(cv2.CAP_PROP_POS_FRAMES, i)
    frame = video.read()[1]
    frames.append(Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))
```
In the above function:
- video.set: Locates the specific video frame based on the frame index.
- video.read()[1]: Returns the image raw data in the BGR order used by the openCV package.
- Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB): Converts the image data from the BGR order to the normal RGB order for processing by VideoMAE.
Create a new figure to visualize the sampled frames using Matplotlib.
python
```
plt.figure(figsize=(60, 60))
```
Define the number of frames to display per row, for example, 4 and the total number of rows.
python
```
frame_display_per_row = 4
frame_display_rows = math.ceil(len(frames) / frame_display_per_row)
```
Visualize and view the sample frames.
python
```
for i, frame in enumerate(frames):
    plt.subplot(frame_display_rows, frame_display_per_row, i+1)
    plt.imshow(frame)
    plt.xticks([])
    plt.yticks([])
plt.tight_layout()
```
Verify that all sampled images in a series display in your notebook. You will feed the sample images to the VideoMAE model for further classification.

Define a new function to return all sampled frames to ease the downstreaming steps.

                            python
                            
                        
def get_sample_frames(video):
    frame_count = video.get(cv2.CAP_PROP_FRAME_COUNT)
    indices = np.linspace(0, frame_count - 1, sample_num).astype(int)
    frames = []
    for i in indices:
        video.set(cv2.CAP_PROP_POS_FRAMES, i)
        frame = video.read()[1]
        frames.append(Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))
    return frames

Perform Video Classification using VideoMAE

Load the VideoMAE image processor pretrained from the Kinetics 400 dataset.

                            python
                            
image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")

Load the VideoMAE pretrained video classification model from the same dataset.

                            python
                            
model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")

Preprocess the frames by resizing, scaling, and normalizing the frame raw data using the Video image processor.
python
```
inputs = image_processor(frames, return_tensors="pt")
```
Get the classification model output using the preprocessed inputs
python
```
outputs = model(**inputs)
```
Extract the logits from the model outputs.
python
```
logits = outputs.logits
```
Get the indices of the top 3 predicted classes among all 400 predefined classes in the Kinetics 400 dataset.
python
```
topk = 3
predicted_indices = logits.topk(topk).indices[0]
```
Get the normalized prediction probability for each class using softmax.
python
```
predicted_prob = torch.nn.functional.softmax(logits, dim=1)[0]
```
Check the top 1 predicted label with its probability.
python
```
pred_idx = predicted_indices[0].item()
print(f"{model.config.id2label[pred_idx]}: {float(predicted_prob[pred_idx])}")
```
Output:
```
reading book: 0.4024769365787506
```
Based on the above prediction output, the classification aligns well with the content of the input video.

Define a new function to perform video classification using the sample video frames as the input and output top-k predicted classes with the highest probabilities.

                            python
                            
                        
def predict_video(frames, topk=3):
    inputs = image_processor(frames, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits

    predicted_indices = logits.topk(topk).indices[0]
    predicted_prob = torch.nn.functional.softmax(logits, dim=1)[0]

    labels, probs = [], []
    for idx in predicted_indices:
        idx = idx.item()
        labels.append(model.config.id2label[idx])
        probs.append(predicted_prob[idx])
    return labels, probs

You can also use the above function to scale up the video classification operation to a list of videos in batches.

Scale Up and Visualize Video Classification On Multiple Videos

Define a new function to display the video frame against its top-k predicted class with the probability for each.

                            python
                            
                        
def plot_result(cnt, frame, filename, labels, probs):
    plt.subplot(3, 4, 2 * cnt + 1)
    plt.imshow(frame)
    plt.title(filename)
    plt.axis("off")

    plt.subplot(3, 4, 2 * cnt + 2)
    plt.grid()
    y = np.arange(topk)
    plt.barh(y, torch.as_tensor(probs))
    plt.gca().invert_yaxis()
    plt.gca().set_axisbelow(True)
    plt.yticks(y, labels)
    plt.xlabel("probability")

The above function displays the video frame against its top-k predicted class that contains all the necessary key information. It includes the frame index cnt, the raw sampled image data frame, the video filename, the predicted top-k labels and probs for each label.

Set up the directory path to store your sample video files. For example, use videos/ in the Jupyter user home directory.
python
```
root = "videos/"
```
Create a new Matplotlib figure to visualize the video classification results for all available videos in the directory.
python
```
plt.figure(figsize=(60, 60))
```
Output:
```
<Figure size 6000x6000 with 0 Axes>
<Figure size 6000x6000 with 0 Axes>
```

Visualize the results.

                            python
                            
                        
for i, filename in enumerate(sorted(os.listdir(root))):
    file_path = os.path.join(root, filename)
    video = cv2.VideoCapture(file_path) 
    frames = get_sample_frames(video)
    labels, probs = predict_video(frames, topk)
    plot_result(i, frames[0], filename, labels, probs)

plt.subplots_adjust(wspace=0.9)
plt.show()

Verify that the visualization results display in your session with all videos classified to the most suitable classes.

video classification results

Conclusion

You have set up a video classification pipeline using VideoMAE on a Vultr Cloud GPU server. You prepared the sample input video frames by extracting the right frames in each time sequence. Then, you applied the input frames to the VideoMAE image processor and video classification models for scaling to a list of video files in batches. For more information, visit the VideoMAE model page.

Download the model pipeline Jupyter Notebook file to view the implementations applied in this article.

Tags:

Video Classification

VideoMAE

Ubuntu

Video Classification and Understanding using VideoMAE

Introduction

Prerequisites

Set Up the Server

Extract Sample Frames from a Video File

Perform Video Classification using VideoMAE

Scale Up and Visualize Video Classification On Multiple Videos

Conclusion

Comments

Products

Features

Solutions

Marketplace

Resources

Company

Tech Talks

Vultr Blogs