Extract Tables from Images on Vultr Cloud GPU

Introduction

Tables are commonly used to represent structured data in print and display documents. Manually detecting, replicating, and recreating tables from graphical files is time-consuming and resource-intensive. However, by leveraging Machine Learning (ML) packages such as YOLO, you can automate table detection and extraction process by utilizing the latest deep learning algorithms.

This article describes how to extract tables from Images using the YOLO package on a Vultr Cloud GPU server.

Prerequisites

Before you start:

Deploy a Ubuntu 22.04 A100 Cloud GPU server using the Vultr marketplace application with at least 8 GB GPU memory.
Access the server using SSH as a non-root user with sudo rights.
Update the server.
Install JupyterLab.

Set Up the Table Detection Model

Follow the steps below to perform table detection to generate the possible data tables with bounding boxes from an existing image or scanned document. To enable the model processese, install all necessary dependency packages and import a sample image in a new Jupyter Notebook session.

Click Notebook within the JupyterLab interface and select Python3 to create a new Kernel file.
In a new code cell, install the required deep-learning dependency packages.
python
```
pip install --ignore-installed ultralyticsplus==0.0.28 ultralytics==8.0.43 "paddleocr>=2.0.1" PyMuPDF==1.21.1 paddlepaddle==2.6.0 "numpy<1.24"
```
The above command installs the ultralyticsplus, ultralytics, and paddleocr packages on your server to activate the YOLO (You Only Look Once) real-time object detection algorithm. Below is the task performed by each dependency package:
- ultralytics: Installs the latest YOLOv8 version.
- ultralyticsplus: Imports Hugging Face utilities for use with the Ultralytics/YOLOv8 package.
- PaddleOCR: Recognizes and extracts text from available images and documents.
- paddlepaddle: Enables the parallel distributed deep learning framework.
- PyMuPDF and numpy: Enable additional PaddleOCR recognition functionalities.
Press Shift + Enter to run the notebook cell.

Import all necessary libraries from the dependency packages.

                            python
                            
                        
from ultralyticsplus import YOLO, render_result
from paddleocr import PPStructure
from PIL.Image import open
import pandas as pd
import numpy as np
import requests

Define new image variables and import a sample image that contains table data. Replace the image_url value with your source image URL.
python
```
image_url = 'https://docs.vultr.com/public/doc-assets/1766/91ba4e6c2f211b77.png'
image_data = requests.get(image_url, stream=True, headers={'User-agent': 'vultr-demo'}).raw
img = open(image_data)
```
The above code imports a sample research screenshot image from a public URL with a vultr-demo HTTP header value. To use local image files, replace the source image URL value with your directory path and the image_data directive with image_data = requests.get(image_url, stream=True).raw.
View the loaded image.
python
```
display(img)
```
Based on your source image, verify that the image correctly loads in your session and ready for detection.
Load the YOLO table detection model.
python
```
model = YOLO('keremberke/yolov8s-table-extraction')
```
To improve the table detection process, modify the YOLO model parameters (such as confidence, threshold, and IOU threshold) to modify the performance and processing time.
Perform table detection and display the detected table data results.
python
```
results = model.predict(img)
render = render_result(model=model, image=img, result=results[0])
render.show()
```
Based on the imported image structure, the model detection result displays with a rectangular boundary box on the target table data.

Extract the Table Contents to CSV Files

Follow the steps below to extract detected table contents generated with the YOLO model to a supported spreadsheet format such as .csv in a standalone file for storage on your server.

Load the PPOCR model to identify your table content language. Replace the lang value with your target data language code. For example, en for English.
python
```
table_engine = PPStructure(lang="en")
```
Generate the bounding boxes in form of x and y coordinators for detected top-left and bottom-right points.
python
```
boxes = results[0].boxes.xyxy
```
Crop the table images from the original document and pass them to the OCR engine to extract a standalone .csv file using a for loop.
python
```
for i, box in enumerate(boxes):
    crop_img = img.crop([i.item() for i in box])
    res = table_engine(np.array(crop_img))
    df = pd.read_html(res[0]["res"]["html"])[0]
    df.to_csv(f"table{i}.csv", index=False)
```
The above for loop crops all detected tables from your base image and creates a new file based on the i variable value such as table0.csv.
View the generated .csv file contents within your notebook session.
python
```
pd.read_csv("table0.csv")
```
Based on your input image, the extracted table contents should display in your session. Replace table0.csv with your actual generated filename. For example, additional files generated by the OCR engine follow a sequential file naming scheme such as table1.csv, and table2.csv.
Right-click the generated .csv files on your Jupyter Notebook navigation bar and select download to save a copy of the file.

Conclusion

You have extracted table data from images using the YOLO and PPOCR computer vision models on a Vultr Cloud GPU server. When integrated with the latest model data, the application can extract tables from large data files, scanned documents, and image-based documents. For more information about the model parameters, visit the PaddleOCR documentation.

Tags:

Table Extraction

CSV

Cloud GPU