Extract Tables from Images on Vultr Cloud GPU

Updated on July 25, 2024
Extract Tables from Images on Vultr Cloud GPU header image

Introduction

Tables are commonly used to represent structured data in print and display documents. Manually detecting, replicating, and recreating tables from graphical files is time-consuming and resource-intensive. However, by leveraging Machine Learning (ML) packages such as YOLO, you can automate table detection and extraction process by utilizing the latest deep learning algorithms.

This article describes how to extract tables from Images using the YOLO package on a Vultr Cloud GPU server.

Prerequisites

Before you start:

Set Up the Table Detection Model

Follow the steps below to perform table detection to generate the possible data tables with bounding boxes from an existing image or scanned document. To enable the model processese, install all necessary dependency packages and import a sample image in a new Jupyter Notebook session.

  1. Click Notebook within the JupyterLab interface and select Python3 to create a new Kernel file.

  2. In a new code cell, install the required deep-learning dependency packages.

    python
    pip install --ignore-installed ultralyticsplus==0.0.28 ultralytics==8.0.43 "paddleocr>=2.0.1" PyMuPDF==1.21.1 paddlepaddle==2.6.0 "numpy<1.24"
    

    The above command installs the ultralyticsplus, ultralytics, and paddleocr packages on your server to activate the YOLO (You Only Look Once) real-time object detection algorithm. Below is the task performed by each dependency package:

    • ultralytics: Installs the latest YOLOv8 version.
    • ultralyticsplus: Imports Hugging Face utilities for use with the Ultralytics/YOLOv8 package.
    • PaddleOCR: Recognizes and extracts text from available images and documents.
    • paddlepaddle: Enables the parallel distributed deep learning framework.
    • PyMuPDF and numpy: Enable additional PaddleOCR recognition functionalities.
  3. Press Shift + Enter to run the notebook cell.

  4. Import all necessary libraries from the dependency packages.

    python
    from ultralyticsplus import YOLO, render_result
    from paddleocr import PPStructure
    from PIL.Image import open
    import pandas as pd
    import numpy as np
    import requests
    
  5. Define new image variables and import a sample image that contains table data. Replace the image_url value with your source image URL.

    python
    image_url = 'https://docs.vultr.com/public/doc-assets/1766/91ba4e6c2f211b77.png'
    image_data = requests.get(image_url, stream=True, headers={'User-agent': 'vultr-demo'}).raw
    img = open(image_data)
    

    The above code imports a sample research screenshot image from a public URL with a vultr-demo HTTP header value. To use local image files, replace the source image URL value with your directory path and the image_data directive with image_data = requests.get(image_url, stream=True).raw.

  6. View the loaded image.

    python
    display(img)
    

    Based on your source image, verify that the image correctly loads in your session and ready for detection.

    Display the base table image

  7. Load the YOLO table detection model.

    python
    model = YOLO('keremberke/yolov8s-table-extraction')
    

    To improve the table detection process, modify the YOLO model parameters (such as confidence, threshold, and IOU threshold) to modify the performance and processing time.

  8. Perform table detection and display the detected table data results.

    python
    results = model.predict(img)
    render = render_result(model=model, image=img, result=results[0])
    render.show()
    

    Based on the imported image structure, the model detection result displays with a rectangular boundary box on the target table data.

    An image with detected table contents

Extract the Table Contents to CSV Files

Follow the steps below to extract detected table contents generated with the YOLO model to a supported spreadsheet format such as .csv in a standalone file for storage on your server.

  1. Load the PPOCR model to identify your table content language. Replace the lang value with your target data language code. For example, en for English.

    python
    table_engine = PPStructure(lang="en")
    
  2. Generate the bounding boxes in form of x and y coordinators for detected top-left and bottom-right points.

    python
    boxes = results[0].boxes.xyxy
    
  3. Crop the table images from the original document and pass them to the OCR engine to extract a standalone .csv file using a for loop.

    python
    for i, box in enumerate(boxes):
        crop_img = img.crop([i.item() for i in box])
        res = table_engine(np.array(crop_img))
        df = pd.read_html(res[0]["res"]["html"])[0]
        df.to_csv(f"table{i}.csv", index=False)
    

    The above for loop crops all detected tables from your base image and creates a new file based on the i variable value such as table0.csv.

  4. View the generated .csv file contents within your notebook session.

    python
    pd.read_csv("table0.csv")
    

    Based on your input image, the extracted table contents should display in your session. Replace table0.csv with your actual generated filename. For example, additional files generated by the OCR engine follow a sequential file naming scheme such as table1.csv, and table2.csv.

    View the converted csv file table data

  5. Right-click the generated .csv files on your Jupyter Notebook navigation bar and select download to save a copy of the file.

    Download the generated table data csv files in Jupyter Notebook

Conclusion

You have extracted table data from images using the YOLO and PPOCR computer vision models on a Vultr Cloud GPU server. When integrated with the latest model data, the application can extract tables from large data files, scanned documents, and image-based documents. For more information about the model parameters, visit the PaddleOCR documentation.