Video Classification and Understanding using VideoMAE

Updated on July 25, 2024
Video Classification and Understanding using VideoMAE header image

Introduction

Video classification is a machine learning process that involves categorizing video scenes into classes by assigning labels depending on the video content. The video classification pipeline inputs a video file and generates a prediction based on a list of predefined classes. The process is commonly used in surveillance systems to help identify and classify specific actions or behaviors captured in the input video footage.

This article explains how to implement a video classification pipeline on a Vultr Cloud GPU server to identify human actions using the Video Masked Autoencoders VideoMAE pre-trained model from the Kinetics 400 Dataset. You will load input videos, sample the frames, feed the video frames into VideoMAE processors, and get the action class for each classified element in the input video clip.

Prerequisites

Before you begin:

Set Up the Server

Follow the steps below to set up the server by installing all necessary dependency packages, download sample videos, and move them to the Notebook directory to use as the input files for the video classification tasks.

  1. Install all dependency library packages using Pip.

    console
    $ pip install transformers pytorchvideo opencv-python matplotlib ipywidgets gdown
    

    The above command installs the following packages on the server.

    • transformers: Offers APIs and pipelines for downloading download and using pretrained transformer-based models. The VideoMAE image processor and classification models are part of the transformers library.
    • pytorchvideo: Provides reusable, modular and efficient components for running video understanding tasks. VideoMAE internally uses torchvision functions to optimize the video classification process.
    • opencv-python: Provides a wide range of image processing functionalities. You will use Opencv to load videos and extract frames.
    • matplotlib: Creates static or interactive visualizations.
    • ipywidgets: Provides interactive HTML widgets for Jupyter Notebook sessions using IPython kernel.
    • gdown: Downloads files from public links such as Google Drive.
  2. Download a sample input file. For example, use the gdown module to download a sample video from a Google Drive link

    console
    $ gdown 1lKgQaafhQFIv6db36Mav72pGBK8CVbYo
    

    You can use any other video file for classification in your session. For purposes of this article, use the sample video provided in the Google Drive URL.

  3. Extract sample videos from the videos.zip archive to the notebooks directory using the Jupyter user home path. By default, JupyterLab uses the jupyter user profile on a Vultr GPU Stack server.

    console
    $ sudo unzip videos.zip -d /home/jupyter/notebooks/
    
  4. View the JupyterLab logs and verify your unique browser access token.

    console
    $ cat /var/log/jupyterlab/lab.log
    

    Your output should look like the one below:

    To access the server, open this file in a browser:
        file:///home/jupyter/.local/share/jupyter/runtime/jpserver-1544-open.html
    Or copy and paste one of these URLs:
        http://localhost:9998/lab?token=7ab5a88b022
        http://127.0.0.1:9998/lab?token=7ab5a88b022
  5. Access your server IP with the Jupyter port 8888 and your access token using a browser such as Firefox.

    https://SERVER-IP:8888/lab?token=7ab5a88b022

Extract Sample Frames from a Video File

  1. Select Notebook within the Jupyter interface to create a new Python3 Kernel file.

  2. In a new notebook code cell, import all required libraries.

    python
    import os
    import cv2
    import math
    import torch
    import numpy as np
    import matplotlib.pyplot as plt
    from PIL import Image
    from transformers import AutoImageProcessor, VideoMAEForVideoClassification
    
  3. Press Shift + Enter to run the code cell and import libraries.

  4. Load the 4.mp4 sample video from the downloaded files using the openCV VideoCapture function.

    python
    video = cv2.VideoCapture("videos/4.mp4")
    
  5. Define the number of sample frames to use with the VideoMAE model. For its pre-trained Kinetics 400 model, VideoMAE uses 16 frames along the timeline as inputs.

    python
    sample_num = 16
    
  6. Extract the number of total frames from the loaded video clip.

    python
    frame_count = video.get(cv2.CAP_PROP_FRAME_COUNT)
    
  7. Get the indices of all sampled frames that are evenly spaced over a specified interval across the entire video.

    python
    indices = np.linspace(0, frame_count - 1, sample_num).astype(int)
    
  8. Get the sampled frames into a list to use as the inputs of the VideoMAE classification model.

    python
    frames = []
    for i in indices:
        video.set(cv2.CAP_PROP_POS_FRAMES, i)
        frame = video.read()[1]
        frames.append(Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))
    

    In the above function:

    • video.set: Locates the specific video frame based on the frame index.
    • video.read()[1]: Returns the image raw data in the BGR order used by the openCV package.
    • Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB): Converts the image data from the BGR order to the normal RGB order for processing by VideoMAE.
  9. Create a new figure to visualize the sampled frames using Matplotlib.

    python
    plt.figure(figsize=(60, 60))
    
  10. Define the number of frames to display per row, for example, 4 and the total number of rows.

    python
    frame_display_per_row = 4
    frame_display_rows = math.ceil(len(frames) / frame_display_per_row)
    
  11. Visualize and view the sample frames.

    python
    for i, frame in enumerate(frames):
        plt.subplot(frame_display_rows, frame_display_per_row, i+1)
        plt.imshow(frame)
        plt.xticks([])
        plt.yticks([])
    plt.tight_layout()
    

    Verify that all sampled images in a series display in your notebook. You will feed the sample images to the VideoMAE model for further classification.

    sample frames

  12. Define a new function to return all sampled frames to ease the downstreaming steps.

    python
    def get_sample_frames(video):
        frame_count = video.get(cv2.CAP_PROP_FRAME_COUNT)
        indices = np.linspace(0, frame_count - 1, sample_num).astype(int)
        frames = []
        for i in indices:
            video.set(cv2.CAP_PROP_POS_FRAMES, i)
            frame = video.read()[1]
            frames.append(Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))
        return frames
    

Perform Video Classification using VideoMAE

  1. Load the VideoMAE image processor pretrained from the Kinetics 400 dataset.

    python
    image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
    
  2. Load the VideoMAE pretrained video classification model from the same dataset.

    python
    model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
    
  3. Preprocess the frames by resizing, scaling, and normalizing the frame raw data using the Video image processor.

    python
    inputs = image_processor(frames, return_tensors="pt")
    
  4. Get the classification model output using the preprocessed inputs

    python
    outputs = model(**inputs)
    
  5. Extract the logits from the model outputs.

    python
    logits = outputs.logits
    
  6. Get the indices of the top 3 predicted classes among all 400 predefined classes in the Kinetics 400 dataset.

    python
    topk = 3
    predicted_indices = logits.topk(topk).indices[0]
    
  7. Get the normalized prediction probability for each class using softmax.

    python
    predicted_prob = torch.nn.functional.softmax(logits, dim=1)[0]
    
  8. Check the top 1 predicted label with its probability.

    python
    pred_idx = predicted_indices[0].item()
    print(f"{model.config.id2label[pred_idx]}: {float(predicted_prob[pred_idx])}")
    

    Output:

    reading book: 0.4024769365787506

    Based on the above prediction output, the classification aligns well with the content of the input video.

  9. Define a new function to perform video classification using the sample video frames as the input and output top-k predicted classes with the highest probabilities.

    python
    def predict_video(frames, topk=3):
        inputs = image_processor(frames, return_tensors="pt")
        outputs = model(**inputs)
        logits = outputs.logits
    
        predicted_indices = logits.topk(topk).indices[0]
        predicted_prob = torch.nn.functional.softmax(logits, dim=1)[0]
    
        labels, probs = [], []
        for idx in predicted_indices:
            idx = idx.item()
            labels.append(model.config.id2label[idx])
            probs.append(predicted_prob[idx])
        return labels, probs
    

    You can also use the above function to scale up the video classification operation to a list of videos in batches.

Scale Up and Visualize Video Classification On Multiple Videos

  1. Define a new function to display the video frame against its top-k predicted class with the probability for each.

    python
    def plot_result(cnt, frame, filename, labels, probs):
        plt.subplot(3, 4, 2 * cnt + 1)
        plt.imshow(frame)
        plt.title(filename)
        plt.axis("off")
    
        plt.subplot(3, 4, 2 * cnt + 2)
        plt.grid()
        y = np.arange(topk)
        plt.barh(y, torch.as_tensor(probs))
        plt.gca().invert_yaxis()
        plt.gca().set_axisbelow(True)
        plt.yticks(y, labels)
        plt.xlabel("probability")
    

    The above function displays the video frame against its top-k predicted class that contains all the necessary key information. It includes the frame index cnt, the raw sampled image data frame, the video filename, the predicted top-k labels and probs for each label.

  2. Set up the directory path to store your sample video files. For example, use ​​videos/ in the Jupyter user home directory.

    python
    root = "videos/"
    
  3. Create a new Matplotlib figure to visualize the video classification results for all available videos in the directory.

    python
    plt.figure(figsize=(60, 60))
    

    Output:

    <Figure size 6000x6000 with 0 Axes>
    <Figure size 6000x6000 with 0 Axes>
  4. Visualize the results.

    python
    for i, filename in enumerate(sorted(os.listdir(root))):
        file_path = os.path.join(root, filename)
        video = cv2.VideoCapture(file_path) 
        frames = get_sample_frames(video)
        labels, probs = predict_video(frames, topk)
        plot_result(i, frames[0], filename, labels, probs)
    
    plt.subplots_adjust(wspace=0.9)
    plt.show()
    

    Verify that the visualization results display in your session with all videos classified to the most suitable classes.

    video classification results

Conclusion

You have set up a video classification pipeline using VideoMAE on a Vultr Cloud GPU server. You prepared the sample input video frames by extracting the right frames in each time sequence. Then, you applied the input frames to the VideoMAE image processor and video classification models for scaling to a list of video files in batches. For more information, visit the VideoMAE model page.

Download the model pipeline Jupyter Notebook file to view the implementations applied in this article.