从 PDF python 中提取/识别表

Question

提问by Alexander McFarlane

Are there any open source libraries that support table identification & extraction?

是否有任何支持表识别和提取的开源库？

By this I mean:

我的意思是：

Identify a table structure exists
Classify the table from its contents
Extract data from the table in a useful output format e.g. JSON / CSV etc.

标识一个表结构存在
根据内容对表格进行分类
以有用的输出格式（例如 JSON / CSV 等）从表中提取数据。

I have looked through similar questions on this topic and found the following:

我查看了有关此主题的类似问题，发现以下内容：

PDFMinerwhich addresses problem 3, but it seems the user is required to specify to PDFMiner where a table structure exists for each table (correct me if I'm wrong)
pdf-table-extractwhich attempts to address problem 1 but according to the To-Dolist, cannot currently identify tables that are separated by whitespace. This is a problem as all tables in my PDFs are separated by whitespace!

PDFMiner解决了问题 3，但似乎需要用户向 PDFMiner 指定每个表存在表结构的位置（如果我错了，请纠正我）
pdf-table-extract试图解决问题 1，但根据待办事项列表，目前无法识别由空格分隔的表。这是一个问题，因为我的 PDF 中的所有表格都用空格分隔！

Currently, I am thinking that I would have to spend a lot of time developing a Machine Learning solution to identify table structures from PDFs. Therefore, any alternative approaches would be more than welcome!

目前，我认为我将不得不花费大量时间开发机器学习解决方案来识别 PDF 中的表格结构。因此，任何替代方法都将受到欢迎！

Answer 1

采纳答案by Kurt Pfeifle

You should definitely have a look at this answer of mine:

你绝对应该看看我的这个答案：

Extracting table contents from a collection of PDF files

从 PDF 文件集合中提取表格内容

and also have a look at all the links included therein.

并查看其中包含的所有链接。

Tabula/TabulaPDFis currently the best table extraction tool that is available for PDF scraping.

Tabula/TabulaPDF是目前最好的可用于 PDF 抓取的表格提取工具。

Answer 2

回答by Ricky McMaster

I'd just like to add to the very helpful answer from Kurt Pfeifle - there is now a Python wrapper for Tabula, and this seems to work very well so far: https://github.com/chezou/tabula-py

我只想补充来自 Kurt Pfeifle 的非常有用的答案 - 现在有一个用于 Tabula 的 Python 包装器，到目前为止这似乎工作得很好：https: //github.com/chezou/tabula-py

This will convert your PDF table to a Pandas data frame. You can also set the area in x,y co-ordinates which is obviously very handy for irregular data.

这会将您的 PDF 表格转换为 Pandas 数据框。您还可以在 x,y 坐标中设置区域，这对于不规则数据显然非常方便。

Answer 3

回答by Ike

After many fruitful hours of exploring OCR libraries, bounding boxes and clustering algorithms - I found a solution so simple it makes you want to cry!

经过许多富有成效的时间探索 OCR 库、边界框和聚类算法 - 我找到了一个非常简单的解决方案，让您想哭！

I hope you are using Linux;

我希望你使用的是 Linux；

pdftotext -layout NAME_OF_PDF.pdf

AMAZING!!

惊人的！！

Now you have a nice text file with all the information lined up in nice columns, now it is trivial to format into a csv etc..

现在你有一个漂亮的文本文件，所有的信息都排列在漂亮的列中，现在格式化为 csv 等很简单。

It is for times like this that I love Linux, these guys came up with AMAZING solutions to everything, and put it there for FREE!

正是在这样的时候，我喜欢 Linux，这些人为所有事情想出了惊人的解决方案，并免费将其放在那里！

从 PDF python 中提取/识别表

提问by Alexander McFarlane

采纳答案by Kurt Pfeifle

回答by Ricky McMaster

回答by Ike

相关推荐

最近更新

标签

从 PDF python 中提取/识别表

提问by Alexander McFarlane

采纳答案by Kurt Pfeifle

回答by Ricky McMaster

回答by Ike

相关推荐

Python 3.4：类型错误：“str”对象不可调用

Python 如何在 Django-admin 中添加自定义搜索框？

Python 在 Windows x64 中运行 Cython - 致命错误 C1083：无法打开包含文件：'basetsd.h'：没有这样的文件或目录

return True/False 实际做什么？（Python）

相关推荐

最近更新

标签