使用 python pandas 打开 pdf 并在表格中阅读

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23284759/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:38:41  来源:igfitidea点击:

Opening a pdf and reading in tables with python pandas

pythonpdfpandas

提问by ccsv

Is it possible to open PDFs and read it in using python pandas or do I have to use the pandas clipboard for this function?

是否可以打开 PDF 并使用 python pandas 阅读它,或者我是否必须使用 pandas 剪贴板来实现此功能?

采纳答案by Daniel

this is not possible. PDF is a data format for printing. The table structure is therefor lost. with some luck you can extract the text with pypdfand guess the former table columns.

这不可能。PDF 是一种用于打印的数据格式。表结构因此丢失。运气好的话,您可以使用pypdf提取文本并猜测前面的表格列。

回答by Matija Han

In case it is a one-off, you can copy the data from your PDF table into a text file, format it (using search-and-replace, Notepad++ macros, a script), save it as a CSV file and load it into Pandas.

如果是一次性的,您可以将 PDF 表中的数据复制到文本文件中,对其进行格式化(使用搜索和替换、Notepad++ 宏、脚本),将其另存为 CSV 文件并将其加载到熊猫。

If you need to do this in a scalable way, you might try this product: http://tabula.technology/. I have not used it yet, so I don't know how well it works, but you can explore it if you need it.

如果您需要以可扩展的方式执行此操作,您可以试试这个产品:http: //tabula.technology/。我还没有使用它,所以我不知道它的效果如何,但如果你需要它,你可以探索它。

回答by JMM

Copy the table data from a PDF and paste into an Excel file (which usually gets pasted as a single rather than multiple columns). Then use FlashFill (available in Excel 2016, not sure about earlier Excel versions) to separate the data into the columns originally viewed in the PDF. The process is fast and easy. Then use Pandas to wrangle the Excel data.

从 PDF 复制表格数据并粘贴到 Excel 文件中(通常粘贴为单列而不是多列)。然后使用 FlashFill(在 Excel 2016 中可用,不确定早期的 Excel 版本)将数据分成最初在 PDF 中查看的列。这个过程快速而简单。然后使用 Pandas 来整理 Excel 数据。

回答by Isac Junior

you can use tabula https://blog.chezo.uno/tabula-py-extract-table-from-pdf-into-python-dataframe-6c7acfa5f302

你可以使用 tabula https://blog.chezo.uno/tabula-py-extract-table-from-pdf-into-python-dataframe-6c7acfa5f302

from tabula import read_pdf
df = read_pdf('data.pdf')

I can see more in the link!

我可以在链接中看到更多!

回答by joselquin

I have been doing some tests with Camelot(https://camelot-py.readthedocs.io/en/master/), and it works very good in many situations. And you can try to adjust some parameters if the default ones doesn't work.

我一直在用Camelot( https://camelot-py.readthedocs.io/en/master/)做一些测试,它在许多情况下都非常有效。如果默认参数不起作用,您可以尝试调整一些参数。

It's similar to Tabula, but it use different algorithms (Tabula use the vector data in the PDF and raster the lines of the table; Camelot uses Hough Transform), so you can try both to find the best one.

它类似于Tabula,但它使用不同的算法(Tabula 使用 PDF 中的矢量数据并对表格的行进行光栅化;Camelot 使用 Hough 变换),因此您可以尝试两者以找到最佳算法。

Both have a web version, so you can try with some example to decide which is the best one for your application.

两者都有网络版本,因此您可以尝试使用一些示例来确定哪个最适合您的应用程序。

回答by Mark

There is a new version of tabulacalled tabula-py

有一个新版本tabulatabula-py

pip install tabula-py

the .read_pdfmethod works just like in the old version, documentation is here: https://pypi.org/project/tabula-py/

.read_pdf方法就像在旧版本中一样工作,文档在这里:https: //pypi.org/project/tabula-py/