使用 python pandas 打开 pdf 并在表格中阅读

Question

提问by ccsv

Is it possible to open PDFs and read it in using python pandas or do I have to use the pandas clipboard for this function?

是否可以打开 PDF 并使用 python pandas 阅读它，或者我是否必须使用 pandas 剪贴板来实现此功能？

Answer 1

采纳答案by Daniel

this is not possible. PDF is a data format for printing. The table structure is therefor lost. with some luck you can extract the text with pypdfand guess the former table columns.

这不可能。PDF 是一种用于打印的数据格式。表结构因此丢失。运气好的话，您可以使用pypdf提取文本并猜测前面的表格列。

Answer 2

回答by Matija Han

In case it is a one-off, you can copy the data from your PDF table into a text file, format it (using search-and-replace, Notepad++ macros, a script), save it as a CSV file and load it into Pandas.

如果是一次性的，您可以将 PDF 表中的数据复制到文本文件中，对其进行格式化（使用搜索和替换、Notepad++ 宏、脚本），将其另存为 CSV 文件并将其加载到熊猫。

If you need to do this in a scalable way, you might try this product: http://tabula.technology/. I have not used it yet, so I don't know how well it works, but you can explore it if you need it.

如果您需要以可扩展的方式执行此操作，您可以试试这个产品：http: //tabula.technology/。我还没有使用它，所以我不知道它的效果如何，但如果你需要它，你可以探索它。

Answer 3

回答by JMM

Copy the table data from a PDF and paste into an Excel file (which usually gets pasted as a single rather than multiple columns). Then use FlashFill (available in Excel 2016, not sure about earlier Excel versions) to separate the data into the columns originally viewed in the PDF. The process is fast and easy. Then use Pandas to wrangle the Excel data.

从 PDF 复制表格数据并粘贴到 Excel 文件中（通常粘贴为单列而不是多列）。然后使用 FlashFill（在 Excel 2016 中可用，不确定早期的 Excel 版本）将数据分成最初在 PDF 中查看的列。这个过程快速而简单。然后使用 Pandas 来整理 Excel 数据。

Answer 4

回答by Isac Junior

you can use tabula https://blog.chezo.uno/tabula-py-extract-table-from-pdf-into-python-dataframe-6c7acfa5f302

你可以使用 tabula https://blog.chezo.uno/tabula-py-extract-table-from-pdf-into-python-dataframe-6c7acfa5f302

from tabula import read_pdf
df = read_pdf('data.pdf')

I can see more in the link!

我可以在链接中看到更多！

Answer 5

回答by joselquin

I have been doing some tests with Camelot(https://camelot-py.readthedocs.io/en/master/), and it works very good in many situations. And you can try to adjust some parameters if the default ones doesn't work.

我一直在用Camelot( https://camelot-py.readthedocs.io/en/master/)做一些测试，它在许多情况下都非常有效。如果默认参数不起作用，您可以尝试调整一些参数。

It's similar to Tabula, but it use different algorithms (Tabula use the vector data in the PDF and raster the lines of the table; Camelot uses Hough Transform), so you can try both to find the best one.

它类似于Tabula，但它使用不同的算法（Tabula 使用 PDF 中的矢量数据并对表格的行进行光栅化；Camelot 使用 Hough 变换），因此您可以尝试两者以找到最佳算法。

Both have a web version, so you can try with some example to decide which is the best one for your application.

两者都有网络版本，因此您可以尝试使用一些示例来确定哪个最适合您的应用程序。

Answer 6

回答by Mark

There is a new version of tabulacalled tabula-py

有一个新版本tabula叫tabula-py

pip install tabula-py

the .read_pdfmethod works just like in the old version, documentation is here: https://pypi.org/project/tabula-py/

该.read_pdf方法就像在旧版本中一样工作，文档在这里：https: //pypi.org/project/tabula-py/

使用 python pandas 打开 pdf 并在表格中阅读

提问by ccsv

采纳答案by Daniel

回答by Matija Han

回答by JMM

回答by Isac Junior

回答by joselquin

回答by Mark

相关推荐

最近更新

标签

使用 python pandas 打开 pdf 并在表格中阅读

提问by ccsv

采纳答案by Daniel

回答by Matija Han

回答by JMM

回答by Isac Junior

回答by joselquin

回答by Mark

相关推荐

Python Pandas：根据多索引数据帧子集的条件设置值的正确方法

如何在 Python 中的一行中输入 2 个整数？

Python 车轮文件安装

Python 如何在pygame中添加背景图像？

相关推荐

最近更新

标签