如何使用 Python 从 PDF 中提取表格作为文本?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47533875/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to extract table as text from the PDF using Python?
提问by venkat
I have a PDF which contains Tables, text and some images. I want to extract the table wherever tables are there in the PDF.
我有一个包含表格、文本和一些图像的 PDF。我想在 PDF 中有表格的任何地方提取表格。
Right now am doing manually to find the Table from the page. From there I am capturing that page and saving into another PDF.
现在我正在手动从页面中查找表格。从那里我捕获该页面并保存到另一个 PDF 中。
import PyPDF2
PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored
pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object
pg4 = pfr.getPage(126) #extract pg 127
writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object
#add pages
writer.addPage(pg4)
NewPDFfilename = "allTables.pdf" #filename of your PDF/directory where you want your new PDF to be
with open(NewPDFfilename, "wb") as outputStream:
writer.write(outputStream) #write pages to new PDF
My goal is to extract the table from the whole PDF document.
我的目标是从整个 PDF 文档中提取表格。
采纳答案by Eric Ihli
This answer is for anyone encountering pdfs with images and needing to use OCR. I could not find a workable off-the-shelf solution; nothing that gave me the accuracy I needed.
此答案适用于遇到带有图像的 pdf 文件并需要使用 OCR 的任何人。我找不到可行的现成解决方案;没有什么可以给我所需的准确性。
Here are the steps I found to work.
以下是我发现可行的步骤。
Use
pdfimages
from https://poppler.freedesktop.org/to turn the pages of the pdf into images.Use Tesseractto detect rotation and ImageMagick
mogrify
to fix it.Use OpenCV to find and extract tables.
Use OpenCV to find and extract each cell from the table.
Use OpenCV to crop and clean up each cell so that there is no noise that will confuse OCR software.
Use Tesseract to OCR each cell.
Combine the extracted text of each cell into the format you need.
使用
pdfimages
从https://poppler.freedesktop.org/打开PDF的页面转换成图像。使用Tesseract检测旋转并使用ImageMagick
mogrify
修复它。使用 OpenCV 查找和提取表。
使用 OpenCV 从表中查找并提取每个单元格。
使用 OpenCV 裁剪和清理每个单元格,这样就不会出现混淆 OCR 软件的噪音。
使用 Tesseract 对每个单元格进行 OCR。
将每个单元格的提取文本组合成您需要的格式。
I wrote a python package with modules that can help with those steps.
我编写了一个带有模块的 python 包,可以帮助完成这些步骤。
Repo: https://github.com/eihli/image-table-ocr
回购:https: //github.com/eihli/image-table-ocr
Docs & Source: https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html
文档和来源:https: //eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html
Some of the steps don't require code, they take advantage of external tools like pdfimages
and tesseract
. I'll provide some brief examples for a couple of the steps that do require code.
有些步骤不需要代码,它们利用了外部工具,如pdfimages
和tesseract
。我将为需要代码的几个步骤提供一些简短的示例。
- Finding tables:
- 查找表:
This link was a good reference while figuring out how to find tables. https://answers.opencv.org/question/63847/how-to-extract-tables-from-an-image/
在弄清楚如何查找表格时,此链接是一个很好的参考。https://answers.opencv.org/question/63847/how-to-extract-tables-from-an-image/
import cv2
def find_tables(image):
BLUR_KERNEL_SIZE = (17, 17)
STD_DEV_X_DIRECTION = 0
STD_DEV_Y_DIRECTION = 0
blurred = cv2.GaussianBlur(image, BLUR_KERNEL_SIZE, STD_DEV_X_DIRECTION, STD_DEV_Y_DIRECTION)
MAX_COLOR_VAL = 255
BLOCK_SIZE = 15
SUBTRACT_FROM_MEAN = -2
img_bin = cv2.adaptiveThreshold(
~blurred,
MAX_COLOR_VAL,
cv2.ADAPTIVE_THRESH_MEAN_C,
cv2.THRESH_BINARY,
BLOCK_SIZE,
SUBTRACT_FROM_MEAN,
)
vertical = horizontal = img_bin.copy()
SCALE = 5
image_width, image_height = horizontal.shape
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (int(image_width / SCALE), 1))
horizontally_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, horizontal_kernel)
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, int(image_height / SCALE)))
vertically_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, vertical_kernel)
horizontally_dilated = cv2.dilate(horizontally_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1)))
vertically_dilated = cv2.dilate(vertically_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (1, 60)))
mask = horizontally_dilated + vertically_dilated
contours, hierarchy = cv2.findContours(
mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE,
)
MIN_TABLE_AREA = 1e5
contours = [c for c in contours if cv2.contourArea(c) > MIN_TABLE_AREA]
perimeter_lengths = [cv2.arcLength(c, True) for c in contours]
epsilons = [0.1 * p for p in perimeter_lengths]
approx_polys = [cv2.approxPolyDP(c, e, True) for c, e in zip(contours, epsilons)]
bounding_rects = [cv2.boundingRect(a) for a in approx_polys]
# The link where a lot of this code was borrowed from recommends an
# additional step to check the number of "joints" inside this bounding rectangle.
# A table should have a lot of intersections. We might have a rectangular image
# here though which would only have 4 intersections, 1 at each corner.
# Leaving that step as a future TODO if it is ever necessary.
images = [image[y:y+h, x:x+w] for x, y, w, h in bounding_rects]
return images
- Extract cells from table.
- 从表中提取单元格。
This is very similar to 2, so I won't include all the code. The part I will reference will be in sorting the cells.
这与 2 非常相似,所以我不会包括所有代码。我将参考的部分是对单元格进行排序。
We want to identify the cells from left-to-right, top-to-bottom.
我们想从左到右、从上到下识别单元格。
We'll find the rectangle with the most top-left corner. Then we'll find all of the rectangles that have a center that is within the top-y and bottom-y values of that top-left rectangle. Then we'll sort those rectangles by the x value of their center. We'll remove those rectangles from the list and repeat.
我们将找到最左上角的矩形。然后我们将找到中心位于左上角矩形的顶部 y 和底部 y 值内的所有矩形。然后我们将按其中心的 x 值对这些矩形进行排序。我们将从列表中删除这些矩形并重复。
def cell_in_same_row(c1, c2):
c1_center = c1[1] + c1[3] - c1[3] / 2
c2_bottom = c2[1] + c2[3]
c2_top = c2[1]
return c2_top < c1_center < c2_bottom
orig_cells = [c for c in cells]
rows = []
while cells:
first = cells[0]
rest = cells[1:]
cells_in_same_row = sorted(
[
c for c in rest
if cell_in_same_row(c, first)
],
key=lambda c: c[0]
)
row_cells = sorted([first] + cells_in_same_row, key=lambda c: c[0])
rows.append(row_cells)
cells = [
c for c in rest
if not cell_in_same_row(c, first)
]
# Sort rows by average height of their center.
def avg_height_of_center(row):
centers = [y + h - h / 2 for x, y, w, h in row]
return sum(centers) / len(centers)
rows.sort(key=avg_height_of_center)
回答by A STEFANI
in my opinion you have 5 possibilities:
在我看来,您有 5 种可能性:
You may extract the tabledirectly using camelotPDF Table Extraction for Humans
You may treat the pdf directlyusing tabula
You may convert the pdf to textusing pdftotext, then parse text with python
You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.
You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data
您可以使用CamelotPDF Table Extraction for Humans直接提取表格
您可以直接使用tabula处理 pdf
您可以使用pdftotext将 pdf 转换为文本,然后使用 python 解析文本
您可以使用外部工具将您的 pdf 文件转换为 excel 或 csv,然后使用所需的 python 模块打开 excel/csv 文件。
您也可以将pdf转换为图像文件,然后使用任何最近的OCR软件(从图片中自动重建表格)来获取数据
Your question is near similar with:
您的问题与以下内容相似:
Regards
问候
回答by Himanshu Poddar
I would suggest you to extract the table using tabula. Pass your pdf as an argument to the tabula api and it will return you the table in the form of dataframe. Each table in your pdf is returned as one dataframe. This is my code for extracting pdf.
我建议您使用 tabula 提取表格。将您的 pdf 作为参数传递给 tabula api,它将以数据框的形式返回表格。pdf 中的每个表都作为一个数据帧返回。这是我提取pdf的代码。
#the table will be returned in a list of dataframe,for working with dataframe you need pandas
import pandas as pd
import tabula
file = "filename.pdf"
path = 'enter your directory path here' + file
df = tabula.read_pdf(path, pages = '1', multiple_tables = True)
print(df)
Please refer to this repoof mine for more details.
有关更多详细信息,请参阅我的这个repo。
回答by josem8f
A 2019 update to the question, as I'm always directed here every time I search for "python extract pdf table"
这个问题的 2019 年更新,因为我每次搜索“python 提取 pdf 表”时总是被引导到这里
There's a python solution called camelot/excalibur
有一个名为 camelot/excalibur 的 python 解决方案
回答by Guy
For a project I was doing, the following worked well for me: first, generate an image from each pdf page (pdf2image) and then run OCR on the images (pytesseract) .
对于我正在做的一个项目,以下对我来说效果很好:首先,从每个 pdf 页面(pdf2image)生成一个图像,然后在图像上运行 OCR (pytesseract)。
Here is a simple git projectimplementing this approach. Enjoy!
这是一个实现这种方法的简单 git 项目。享受!