One of the most prominent technologies developed in the last decade has been information extraction. Enterprises are automating various problems using information extraction techniques. Today, businesses are using different solutions for object detection and classification to extract information using Deep learning models or frameworks.
Detecting objects in the image/Video is a complex task for computers, unlike humans. Object detection is the identification of a required/specific image object from a complex image/video.
Object classification, on the other hand, is classifying the identified image object based on trained labels. Most of the popular architectures to detect or classify such objects were built using Convolutional Neural Networks.
YOLOv5 model for information extraction
Several solutions have been developed to assist computers in detecting objects. The blog focuses on using YOLOv5 to extract non-textual information. With YOLOv5, we will detect and classify the predefined template/form (scanned, system generate, etc.) based on the checkboxes as tick/no tick marks. The blog will also highlight how precise the model classifications are.
One of the widely used object detection and localization architectures, YOLO stands for You Only Look Once. As the name implies, the model learns the input/output patterns by only looking at the image once.
Jupyter Notebook and labelImg are the tools used to build the YOLOv5 model to detect checkboxes in the forms. Below is the high-level YOLOv5 architecture.
“torch” – to use YOLOv5 architecture which works on pytorch; “matplotlib” – to view the object detected (results) by bounding boxes; “cv2” – OpenCV-python to open images for locating bounding boxes and detecting the required objects.
- Forms in pdf format are converted to images and these images were fed to the YOLO model.
- The checkbox in the form is the object that needs to be detected.
- Train and validation images folder with their labels that has the details of bounding box coordinates.
- Labels for images were created using labelImg.
- To build this model, a total of 696 train images and 231 validation images were used.
Steps to perform the task using YOLOv5:
- 1. Import and clone the YOLOv5 architecture from git.
- 2. Import all the required libraries.
- 3. Train and wrap the model.
The above sample images are the area of interest (which is a part) from the template. This is the checkbox that is used to classify the form for further steps. The corresponding label data is also shown. The label data contains – class of the object (not tick/tick), x1, y1, x2, y2. Class 0 is given for checkboxes that have no tick marks and Class 1 is given for checkboxes that are ticked.
In this manner, all the images are considered as ‘X’ and their corresponding labels are considered as ‘Y’ by the model. Each pixel of the image is considered a feature.
Each image of size 640 is divided into pixels which results in 640*640 features. Image Matrix has pixel value in each cell and filter of 3*3 matrix parses through the image matrix which results in n*n matrix which has some feature. The feature obtained from the first image is mapped with the second image which is known as feature mapping. In this stage, different image annotations, mosaic augmentation, etc., are performed. Its features like pixel value, edge detection, etc., are extracted.
In the YOLO model, each pixel will have feature data related to the probability of the class, coordinates of the bounding box (x, y, width, and height), and the value of classes.
Considering the checkbox image, the first cell (marked in blue) has no object, hence the probability of class is 0, and thus all other information related to bounding boxes is ignored. Whereas, the cell that has a checkbox (marked in yellow), the probability of the class is 1 and its bounding box values are recorded. In the image above, it is seen that checkbox has no tick class, hence that is recorded as 1 and the class tick is recorded as 0.
There is a requirement to build custom weights because YOLOv5 is built on the coco dataset. This is called transfer learning, where the first few layers have already learned what kind of features to extract, what to look into the image, etc. Best weights obtained from training are added to the pre-trained model to make it customized to the checkbox dataset.
This customized model is built on 30 epochs which indicates that 30 times the entire dataset was passed to training with a batch size of 1. The weights generated at the best epoch (least loss value) are considered the best weight.
- 4. Model Testing
When a raw image is passed as input, YOLOv5 detects where the checkbox is located along with its bounding box coordinates, and its value is captured. The above image shows the detected checkbox, and the below image is the data frame of coordinates and the value. Here, the checkbox is ticked.
Evaluation metrics for YOLOv5 are considered based on metrics/mAP_0.5 and metrics/mAP_0.5:0.95.
For the model that is built,
- 1. metrics/mAP_0.5 value is 0.995 with least validation loss and
- 2. metrics/mAP_0.5:0.95 value is 0.74839 with least validation loss
With a total of 800+ images used in training and validation, around 100 images had their checkbox ticked and the remaining 500+ images had no tick.
The above image is a confusion matrix which shows that each image is predicted with 100% accuracy.
The above graph is the f1_curve graph and the model’s confidence value that optimizes the precision and recall is 0.737.
The ROC curve above shows the trade-off between Precision and Recall. It means that the model is creating a perfecting rectangle ensuring that there is no disturbance in the training phase and classified correctly.
In the figure above, the annotated image by labelImg (on left) and the predicted result by YOLOv5 (on right) show that the model can predict and classify the checkbox appropriately. It has been noted that the model can predict live/on production forms for the trained template. Below are a few consolidated samples that are predicted/classified by the model during the production.
Overall, above are a few production samples that say that the model can predict the checkbox area and tick/no tick of the checkboxes in all the cases.
The model is confident enough to identify both the tick and no tick scenarios with around 0.8 probability. This helped to attain more than 96% accuracy during production as all the files in production resulted with greater than 0.5 probability and correct class.
Below is a high-level procedure to perform the YOLO model using all the available forms of standard template that can cover all the different variations (such as scanned, system generated, scanning system generated, blur scanned, etc.):
- a. Annotated the images with LabelImg tool.
- b. Trained YOLO model on trained files and tested on validation files.
- c. Created an API endpoint to consume the production files and populate the classification result.
We were able to achieve around 96% accuracy with the YOLO v5 model for checkboxes classification. Thus, the YOLOv5 model is more effective to detect checkbox tick marks and classify the forms.
Such YOLO models are beneficial in automating the manual inspection process. This gives a seamless user experience with immediate processing and an expected level of accuracy once the user uploads the documents.