Python 如何在图像中找到类似表格的结构

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50829874/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:36:38  来源:igfitidea点击:

How to find table like structure in image

pythonimageopencvimage-processing

提问by Mohamed Thasin ah

I have different type of invoice files, I want to find table in each invoice file. In this table position is not constant. So I go for image processing. First I tried to convert my invoice into image, then I found contour based on table borders, Finally I can catch table position. For the task I used below code.

我有不同类型的发票文件,我想在每个发票文件中找到表格。在这张桌子上的位置不是恒定的。所以我去图像处理。首先,我尝试将发票转换为图像,然后根据表格边框找到轮廓,最后我可以捕捉表格位置。对于我使用以下代码的任务。

with Image(page) as page_image:
    page_image.alpha_channel = False #eliminates transperancy
    img_buffer=np.asarray(bytearray(page_image.make_blob()), dtype=np.uint8)
    img = cv2.imdecode(img_buffer, cv2.IMREAD_UNCHANGED)

    ret, thresh = cv2.threshold(img, 127, 255, 0)
    im2, contours, hierarchy = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
    margin=[]
    for contour in contours:
        # get rectangle bounding contour
        [x, y, w, h] = cv2.boundingRect(contour)
        # Don't plot small false positives that aren't text
        if (w >thresh1 and h> thresh2):
                margin.append([x, y, x + w, y + h])
    #data cleanup on margin to extract required position values.

In this code thresh1, thresh2i'll update based on the file.

在此代码中thresh1thresh2我将根据文件进行更新。

So using this code I can successfully read positions of tables in images, using this position i'll work on my invoice pdf file. For example

因此,使用此代码我可以成功读取图像中表格的位置,使用此位置我将处理我的发票 pdf 文件。例如

Sample 1:

示例 1:

enter image description here

在此处输入图片说明

Sample 2:

示例 2:

enter image description here

在此处输入图片说明

Sample 3: enter image description here

示例 3: 在此处输入图片说明

Output:

输出:

Sample 1:

示例 1:

enter image description here

在此处输入图片说明

Sample 2:

示例 2:

enter image description here

在此处输入图片说明

Sample 3:

示例 3:

enter image description here

在此处输入图片说明

But, now I have a new format which doesn't have any borders but it's a table. How to solve this? Because my entire operation depends only on borders of the tables. But now I don't have a table borders. How can I achieve this? I don't have any idea to move out from this problem. My question is, Is there any way to find position based on table structure?.

但是,现在我有了一个没有任何边框但它是一个表格的新格式。如何解决这个问题?因为我的整个操作仅取决于表格的边框。但是现在我没有表格边框。我怎样才能做到这一点?我没有任何想法摆脱这个问题。我的问题是,有没有办法根据表结构找到位置?

For example My problem input looks like below:

例如我的问题输入如下所示:

enter image description here

在此处输入图片说明

I would like to find its position like below: enter image description here

我想找到它的位置如下: 在此处输入图片说明

How can I solve this? It is really appreciable to give me an idea to solve the problem.

我该如何解决这个问题?给我一个解决问题的想法真的很值得。

Thanks in advance.

提前致谢。

回答by Dmytro

Vaibhav is right. You can experiment with the different morphological transforms to extract or group pixels into different shapes, lines, etc. For example, the approach can be the following:

瓦巴夫是对的。您可以尝试使用不同的形态变换将像素提取或分组为不同的形状、线条等。例如,方法如下:

  1. Start from the Dilation to convert the text into the solid spots.
  2. Then apply the findContours function as a next step to find text bounding boxes.
  3. After having the text bounding boxes it is possible to apply some heuristics algorithm to cluster the text boxes into groups by their coordinates. This way you can find a groups of text areas aligned into rows and columns.
  4. Then you can apply sorting by x and y coordinates and/or some analysis to the groups to try to find if the grouped text boxes can form a table.
  1. 从扩张开始,将文本转换为实心点。
  2. 然后应用 findContours 函数作为下一步查找文本边界框。
  3. 在拥有文本边界框之后,可以应用一些启发式算法来根据文本框的坐标将文本框分组。通过这种方式,您可以找到一组对齐成行和列的文本区域。
  4. 然后,您可以按 x 和 y 坐标排序和/或对组进行一些分析,以尝试查找分组的文本框是否可以形成表格。

I wrote a small sample illustrating the idea. I hope the code is self explanatory. I've put some comments there too.

我写了一个小样本来说明这个想法。我希望代码是不言自明的。我也在那里发表了一些评论。

import os
import cv2
import imutils

# This only works if there's only one table on a page
# Important parameters:
#  - morph_size
#  - min_text_height_limit
#  - max_text_height_limit
#  - cell_threshold
#  - min_columns


def pre_process_image(img, save_in_file, morph_size=(8, 8)):

    # get rid of the color
    pre = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    # Otsu threshold
    pre = cv2.threshold(pre, 250, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
    # dilate the text to make it solid spot
    cpy = pre.copy()
    struct = cv2.getStructuringElement(cv2.MORPH_RECT, morph_size)
    cpy = cv2.dilate(~cpy, struct, anchor=(-1, -1), iterations=1)
    pre = ~cpy

    if save_in_file is not None:
        cv2.imwrite(save_in_file, pre)
    return pre


def find_text_boxes(pre, min_text_height_limit=6, max_text_height_limit=40):
    # Looking for the text spots contours
    # OpenCV 3
    # img, contours, hierarchy = cv2.findContours(pre, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
    # OpenCV 4
    contours, hierarchy = cv2.findContours(pre, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)

    # Getting the texts bounding boxes based on the text size assumptions
    boxes = []
    for contour in contours:
        box = cv2.boundingRect(contour)
        h = box[3]

        if min_text_height_limit < h < max_text_height_limit:
            boxes.append(box)

    return boxes


def find_table_in_boxes(boxes, cell_threshold=10, min_columns=2):
    rows = {}
    cols = {}

    # Clustering the bounding boxes by their positions
    for box in boxes:
        (x, y, w, h) = box
        col_key = x // cell_threshold
        row_key = y // cell_threshold
        cols[row_key] = [box] if col_key not in cols else cols[col_key] + [box]
        rows[row_key] = [box] if row_key not in rows else rows[row_key] + [box]

    # Filtering out the clusters having less than 2 cols
    table_cells = list(filter(lambda r: len(r) >= min_columns, rows.values()))
    # Sorting the row cells by x coord
    table_cells = [list(sorted(tb)) for tb in table_cells]
    # Sorting rows by the y coord
    table_cells = list(sorted(table_cells, key=lambda r: r[0][1]))

    return table_cells


def build_lines(table_cells):
    if table_cells is None or len(table_cells) <= 0:
        return [], []

    max_last_col_width_row = max(table_cells, key=lambda b: b[-1][2])
    max_x = max_last_col_width_row[-1][0] + max_last_col_width_row[-1][2]

    max_last_row_height_box = max(table_cells[-1], key=lambda b: b[3])
    max_y = max_last_row_height_box[1] + max_last_row_height_box[3]

    hor_lines = []
    ver_lines = []

    for box in table_cells:
        x = box[0][0]
        y = box[0][1]
        hor_lines.append((x, y, max_x, y))

    for box in table_cells[0]:
        x = box[0]
        y = box[1]
        ver_lines.append((x, y, x, max_y))

    (x, y, w, h) = table_cells[0][-1]
    ver_lines.append((max_x, y, max_x, max_y))
    (x, y, w, h) = table_cells[0][0]
    hor_lines.append((x, max_y, max_x, max_y))

    return hor_lines, ver_lines


if __name__ == "__main__":
    in_file = os.path.join("data", "page.jpg")
    pre_file = os.path.join("data", "pre.png")
    out_file = os.path.join("data", "out.png")

    img = cv2.imread(os.path.join(in_file))

    pre_processed = pre_process_image(img, pre_file)
    text_boxes = find_text_boxes(pre_processed)
    cells = find_table_in_boxes(text_boxes)
    hor_lines, ver_lines = build_lines(cells)

    # Visualize the result
    vis = img.copy()

    # for box in text_boxes:
    #     (x, y, w, h) = box
    #     cv2.rectangle(vis, (x, y), (x + w - 2, y + h - 2), (0, 255, 0), 1)

    for line in hor_lines:
        [x1, y1, x2, y2] = line
        cv2.line(vis, (x1, y1), (x2, y2), (0, 0, 255), 1)

    for line in ver_lines:
        [x1, y1, x2, y2] = line
        cv2.line(vis, (x1, y1), (x2, y2), (0, 0, 255), 1)

    cv2.imwrite(out_file, vis)

I've got the following output:

我有以下输出:

Sample table extraction

样本表提取

Of course to make the algorithm more robust and applicable to a variety of different input images it has to be adjusted correspondingly.

当然,为了使算法更健壮并适用于各种不同的输入图像,它必须进行相应的调整。

Update:Updated the code with respect to the OpenCV API changes for findContours. If you have older version of OpenCV installed - use the corresponding call. Related post.

更新:更新了关于 OpenCV API 更改的代码findContours。如果您安装了旧版本的 OpenCV - 使用相应的调用。相关帖子

回答by Vaibhav Mehrotra

You can try applying some morphological transforms (such as Dilation, Erosion or Gaussian Blur) as a pre-processing step before your findContours function

您可以尝试在 findContours 函数之前应用一些形态变换(例如膨胀、侵蚀或高斯模糊)作为预处理步骤

For example

例如

blur = cv2.GaussianBlur(g, (3, 3), 0)
ret, thresh1 = cv2.threshold(blur, 150, 255, cv2.THRESH_BINARY)
bitwise = cv2.bitwise_not(thresh1)
erosion = cv2.erode(bitwise, np.ones((1, 1) ,np.uint8), iterations=5)
dilation = cv2.dilate(erosion, np.ones((3, 3) ,np.uint8), iterations=5)

The last argument, iterations shows the degree of dilation/erosion that will take place (in your case, on the text). Having a small value will results in small independent contours even within an alphabet and large values will club many nearby elements. You need to find the ideal value so that only that block of your image gets.

最后一个参数,迭代显示将发生的膨胀/侵蚀程度(在您的情况下,在文本上)。即使在字母表中,具有较小的值也会导致较小的独立轮廓,而较大的值将包含许多附近的元素。您需要找到理想的值,以便只有您的图像块获得。

Please note that I've taken 150 as the threshold parameter because I've been working on extracting text from images with varying backgrounds and this worked out better. You can choose to continue with the value you've taken since it's a black & white image.

请注意,我将 150 作为阈值参数,因为我一直致力于从具有不同背景的图像中提取文本,并且效果更好。您可以选择继续使用您所取的值,因为它是黑白图像。

回答by Devashish Prasad

There are many types of tables in the document images with too much variations and layouts. No matter how many rules you write, there will always appear a table for which your rules will fail. This types of problems are genrally solved using ML(Machine Learning) based solutions. You can find many pre-implemented codes on github for solving the problem of detecting tables in the images using ML or DL (Deep Learning).

文档图像中的表格类型很多,变化和布局太多。不管你写了多少规则,总会出现一张你的规则会失败的表。这类问题一般使用基于 ML(机器学习)的解决方案来解决。您可以在 github 上找到许多预先实现的代码,用于解决使用 ML 或 DL(深度学习)检测图像中的表的问题。

Here is my code along with the deep learning models, the model can detect various types of tables as well as the structure cells from the tables: https://github.com/DevashishPrasad/CascadeTabNet

这是我的代码以及深度学习模型,该模型可以检测各种类型的表格以及表格中的结构单元:https: //github.com/DevashishPrasad/CascadeTabNet

The approach achieves state of the art on various public datasets right now (10th May 2020) as far as the accuracy is concerned

就准确性而言,该方法目前(2020 年 5 月 10 日)在各种公共数据集上达到了最先进的水平

More details : https://arxiv.org/abs/2004.12629

更多详情:https: //arxiv.org/abs/2004.12629