Apply computer vision on the questionaire image to detect ticked checkboxes

Loichau
9 min readJul 28, 2020

--

Inputing data could be a tedious task that could waste hours of our time. What if we could apply computer vision to detect ticked checkboxes, then convert it to DataFrame in the most convenient way. This idea was poped up when I was inputing data from a list of 200 different questionaires with more than 8 pages for each participant.

This article will go through steps including:

  1. Scan document
  2. Detect checkboxes
  3. Mask regions outside the checkboxes
  4. Detect checkboxes belonging to one question (or in a same row)
  5. Adjust size of contours
  6. Pull features of every checkbox for prediction
  7. Predict checkboxes that are ticked

___________________________________________________________________

  1. Scan document

This step is not very challenging at all, you can find a lot of reference for this. However, for simplicity, I applied the method from the article “How to Build a Kick-Ass Mobile Document Scanner in Just 5 Minutes” of Adrian Rosebrock. If you want to explore more about it, please click the link below.

To scan the document, I first found the edges in the image by convert BGR image to then gray scale, blur it by Gaussian then apply Canny for the image.

Secondly, I looped over all contours conditioned that this contours must have 4 points (represent for four points of the paper) and the size of this contour must be large as well.

Finally, to obtain a “bird-view” image, I applied a perspective transform with a proper threshold. Since Adrian Rosebrock explains this step better than me, I strongly recommend you to have a look at his works

Edge detect and bird view convert

2. Detect checkbox

According to my research (mainly on StackOverflow), there are two solutions for detecting a checkbox.

First is to measure the ratio of length over width (or vice versa )of every single detected contour from the image. By doing so, you would obtain a square checkbox since the four sides of the square are equal. However, it would be better for you put a threshold for the ratio between 0.8 and 1.2 because if you are too strictly, you would ends up losing all the checkboxes. With the threshold between 0.8 and 1.2, I lost 1 checkbox in the image below.

Formula to get box ratio

However, there is a chance where the ratio reflects the square shape but, in reality, the contours is not exactly square like the case below.

Bad contour vs good contour

The second approach is to detect all horizontal line and vertical line existing in the page, then use cv2.connectedComponentsWithStats to form a square box.

However, the disadvantage is that this method possibly would include other rectange shapes and rectangle noises besides the checkbox.

Therefore, in my works, I combined both two methods by firstly detecting all vertical and horizontal lines to form a square box, then I used the aspect ratio of contours as a filter to make sure it has a square shape.

I carried out the horizontal line and vertical line detection step by following the guidance of Sreekiran in his StackOverFlow question. I already attached the link below for your information about the orginal post. Besides, one difficulty of this step is to find an optimal value for the line_width and min_line_lengththat suitable with your image, in this case I chose 1 pixel line_width and 9 pixel for the min_line_length which worked almost the best since it could detect 95% of all checkbox.

Dont forget to store all x , y , w , h of every contours for further investigation

3. Mask regions outside the box

In this step, I applied cv2.fillPoly and cv2.bitwise_and to mask regions outside the box.

The purpose of masking out regions outside the box is mainly to serve the following step at which we detect checkboxes belonging to one question (in a same row)

4. Detect checkboxes belonging to one question (or in a same row)

Motivation of this step is to find mean of color number (color code) row by row of the whole image, then we use find_peaks function of scipy library to detect number of peaks (questions) in the paper.

Before that, I applied canny , erode and dilate to the image so as to avoid the case that two horizontal lines of a checkbox accidentially create two close peaks.

Yet, my result was too “sharp” to an extend that find_peaks function detects two peaks within a small range.

The optimal solution for this is to blur and flatten it for many times.

Since we know that our contours was created based on four points, and our peak line always cut through the belly of the whole contour, we could utilize x , y , w , h coordinate of every contours to classify different checkboxes into one question.

This method was tipped by Dan Mašek in his reponse to one StackOverflow question. For your further curiousity about it, please access to the link below.

5. Adjust size of contours

Factors such as the surrounding evironment (light, darkness, etc) when we took the photo of the paper ; could affect the performance of detecting horizontal and vertical line step. Therefore, some detected contours cover more outside the checkbox then we want. Since, this would affect our prediction in the future on whether a checkbox is ticked or not. We need to address this problem.

From the existing features x , y , w , h of all contours and the line peak, I applied two solutions that can solve almost 99% of all existing error contours.

Firstly, I strictly convert all wide and height value of every contour to 13 (this value could be different depending on the size of the checkbox and distance from camera to the paper)

Secondly, I compared the distance between the y coordinate and the peaks of all checkboxes in a same question. If the distance is too large (>15), I converted the y coordinate of that checkbox equal to the ycoordinate of another checkbox that has smallest distance to the peak line.

5. Pull features of every check box for prediction

For every contour, I pulled out four main features for the future prediction namely:

  • number of contours
  • highest values of hierarchy
  • ratio of non zero
  • ratio of non zero after erode , dilate , subtract

The ratio of non zero after erode , dilate , subtract feature was used in reference of the below.

Let’s check out how different these four features in the case of a ticked checkbox and a non-ticked checkbox.

Features of non ticked checkbox
Features of ticked checkbox

However, those above are in an idea case where the feature of a ticked checkbox and a non-ticked checkbox is different as day and night. However, not all cases are the same.

Therefore, I investigated further by checking the correlation between these four features with the actual choice.

Since all features have positive correlation coefficient with the actual choice, I applied the basic regression to classify which checkboxes are ticked. In the below formula, I multiple the feature of a single checkbox with the found correlation between that feature with the actual choice, then sum it up.

By applying the formula below, the contour that contain the ticked checkbox, will always have the highest values.

I already filled in the actual choice value to test the correction of the prediction. Our current dataframe would look like this:

Over 50 question, with around 250 different checkboxes, the prediction gave 100% correction. With the worst case where the chosen checkbox had 0.725 higher value then the max value of non chosen checkboxes in the same question.

The best case was where the chosen checkbox had 7.108 higher value than the max value of non choice checkbox in the same question.

In fact, with deep learning, we could improve the correction by finding an optimal parameter where the prediction value of a chosen checkbox is a lot higher than the max prediction value of non chosen checkbox. However, since deep learning is beyond my ability and knowledge, I would stop right here.

Conclusion:

Despite my effort of finding the best way to sovle the problem, there are a few disadvantages such as:

  1. Depending on the quality of image as well as the brightness and darkness of the environment, the scan image step may fail to scan the paper. In my case, I had to find a good environment and a camera clamps holding a camera (or a phone facing on the table) so as to make sure that all images are consistent and in the same condition.
  2. In the checkbox step, the aspect ratio may make us losing a few checkboxes. Therefore, in my case, I used Tkinter to show up the prediction and the paper at the same time for the final check. And if it is wrong, I can fix it immediately, before convert it into Excel file.

Here is the link to the code:

Attention:

There are a few drawbacks of my code that you should optimize and adjust it following your case.

  1. Adjust the threshold depending on the light and darkness of your environment (your image)
  2. Adjust the line width and line min width of the code suitable with your box size(my code is lineWidth = 1, lineMinWidth = 9)

For any further discussion, you could contact me through email:

loichau997@gmail.com

--

--