从 PDF python 中提取/识别表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28532770/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:23:45  来源:igfitidea点击:

Extract / Identify Tables from PDF python

pythonpdfscrapepdf-scraping

提问by Alexander McFarlane

Are there any open source libraries that support table identification & extraction?

是否有任何支持表识别和提取的开源库?

By this I mean:

我的意思是:

  1. Identify a table structure exists
  2. Classify the table from its contents
  3. Extract data from the table in a useful output format e.g. JSON / CSV etc.
  1. 标识一个表结构存在
  2. 根据内容对表格进行分类
  3. 以有用的输出格式(例如 JSON / CSV 等)从表中提取数据。

I have looked through similar questions on this topic and found the following:

我查看了有关此主题的类似问题,发现以下内容:

  • PDFMinerwhich addresses problem 3, but it seems the user is required to specify to PDFMiner where a table structure exists for each table (correct me if I'm wrong)
  • pdf-table-extractwhich attempts to address problem 1 but according to the To-Dolist, cannot currently identify tables that are separated by whitespace. This is a problem as all tables in my PDFs are separated by whitespace!
  • PDFMiner解决了问题 3,但似乎需要用户向 PDFMiner 指定每个表存在表结构的位置(如果我错了,请纠正我)
  • pdf-table-extract试图解决问题 1,但根据待办事项列表,目前无法识别由空格分隔的表。这是一个问题,因为我的 PDF 中的所有表格都用空格分隔!

Currently, I am thinking that I would have to spend a lot of time developing a Machine Learning solution to identify table structures from PDFs. Therefore, any alternative approaches would be more than welcome!

目前,我认为我将不得不花费大量时间开发机器学习解决方案来识别 PDF 中的表格结构。因此,任何替代方法都将受到欢迎!

采纳答案by Kurt Pfeifle

You should definitely have a look at this answer of mine:

你绝对应该看看我的这个答案:

and also have a look at all the links included therein.

并查看其中包含的所有链接。

Tabula/TabulaPDFis currently the best table extraction tool that is available for PDF scraping.

Tabula/TabulaPDF是目前最好的可用于 PDF 抓取的表格提取工具。

回答by Ricky McMaster

I'd just like to add to the very helpful answer from Kurt Pfeifle - there is now a Python wrapper for Tabula, and this seems to work very well so far: https://github.com/chezou/tabula-py

我只想补充来自 Kurt Pfeifle 的非常有用的答案 - 现在有一个用于 Tabula 的 Python 包装器,到目前为止这似乎工作得很好:https: //github.com/chezou/tabula-py

This will convert your PDF table to a Pandas data frame. You can also set the area in x,y co-ordinates which is obviously very handy for irregular data.

这会将您的 PDF 表格转换为 Pandas 数据框。您还可以在 x,y 坐标中设置区域,这对于不规则数据显然非常方便。

回答by Ike

After many fruitful hours of exploring OCR libraries, bounding boxes and clustering algorithms - I found a solution so simple it makes you want to cry!

经过许多富有成效的时间探索 OCR 库、边界框和聚类算法 - 我找到了一个非常简单的解决方案,让您想哭!

I hope you are using Linux;

我希望你使用的是 Linux;

pdftotext -layout NAME_OF_PDF.pdf

pdftotext -layout NAME_OF_PDF.pdf

AMAZING!!

惊人的!!

Now you have a nice text file with all the information lined up in nice columns, now it is trivial to format into a csv etc..

现在你有一个漂亮的文本文件,所有的信息都排列在漂亮的列中,现在格式化为 csv 等很简单。

It is for times like this that I love Linux, these guys came up with AMAZING solutions to everything, and put it there for FREE!

正是在这样的时候,我喜欢 Linux,这些人为所有事情想出了惊人的解决方案,并免费将其放在那里!