在python中使用PDFMiner从PDF文件中提取文本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26494211/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:33:27  来源:igfitidea点击:

Extracting text from a PDF file using PDFMiner in python?

pythonpython-3.xpython-2.7text-extractionpdfminer

提问by DuckPuncher

I am looking for documentation orexamples on how to extract text from a PDF file using PDFMiner with Python.

我正在寻找有关如何使用 PDFMiner 和 Python 从 PDF 文件中提取文本的文档示例。

It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have changed). The libraries I have found that make the task of extracting text from a PDF file easier are using the old PDFMiner syntax so I'm not sure how to do this.

看起来 PDFMiner 更新了他们的 API,我发现的所有相关示例都包含过时的代码(类和方法已更改)。我发现可以更轻松地从 PDF 文件中提取文本的库使用旧的 PDFMiner 语法,因此我不确定如何执行此操作。

As it is, I'm just looking at source-code to see if I can figure it out.

事实上,我只是在查看源代码,看看我是否能弄清楚。

采纳答案by DuckPuncher

Here is a working example of extracting text from a PDF file using the current version of PDFMiner(September 2016)

这是使用当前版本的 PDFMiner(2016 年 9 月)从 PDF 文件中提取文本的工作示例

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

PDFMiner's structure changed recently, so this should work for extracting text from the PDF files.

PDFMiner 的结构最近发生了变化,因此这应该适用于从 PDF 文件中提取文本。

Edit: Still working as of the June 7th of 2018. Verified in Python Version 3.x

编辑:截至 2018 年 6 月 7 日仍在工作。已在 Python 3.x 版中验证

Edit: The solution works with Python 3.7 at October 3, 2019. I used the Python library pdfminer.six, released on November 2018.

编辑:该解决方案适用于 2019 年 10 月 3 日的 Python 3.7。我使用了pdfminer.six2018 年 11 月发布的 Python 库。

回答by juan Isaza

terrific answer from DuckPuncher, for Python3 make sure you install pdfminer2 and do:

DuckPuncher 的出色回答,对于 Python3,请确保您安装 pdfminer2 并执行以下操作:

import io

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage


def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = io.StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,
                                  password=password,
                                  caching=caching,
                                  check_extractable=True):
        interpreter.process_page(page)



    fp.close()
    device.close()
    text = retstr.getvalue()
    retstr.close()
    return text

回答by Brault Gilbert

this code is tested with pdfminer for python 3 (pdfminer-20191125)

此代码使用 pdfminer for python 3 (pdfminer-20191125) 进行测试

from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LTTextBoxHorizontal

def parsedocument(document):
    # convert all horizontal text into a lines list (one entry per line)
    # document is a file stream
    lines = []
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(document):
            interpreter.process_page(page)
            layout = device.get_result()
            for element in layout:
                if isinstance(element, LTTextBoxHorizontal):
                    lines.extend(element.get_text().splitlines())
    return lines

回答by Pieter

Full disclosure, I am one of the maintainers of pdfminer.six.

完全公开,我是pdfminer.six的维护者之一。

Nowadays, there are multiple api's to extract text from a PDF, depending on your needs. Behind the scenes, all of these api's use the same logic for parsing and analyzing the layout.

如今,有多种 api 可以根据您的需要从 PDF 中提取文本。在幕后,所有这些 api 都使用相同的逻辑来解析和分析布局。

Commandline

命令行

If you want to extract text just once you can use the commandline tool pdf2txt.py:

如果您只想提取一次文本,可以使用命令行工具 pdf2txt.py:

$ pdf2txt.py example.pdf

High-level api

高级api

If you want to extract text with Python, you can use the high-level api. This approach is the go-to solution if you want to extract text programmatically from many PDF's.

如果要使用 Python 提取文本,可以使用高级 api。如果您想以编程方式从许多 PDF 中提取文本,则此方法是首选解决方案。

from pdfminer.high_level import extract_text

text = extract_text('samples/simple1.pdf')

Composable api

可组合的 api

There is also a composable api that gives a lot of flexibility in handling the resulting objects. For example, you can implement your own layout algorithm using that. This method is suggested in the other answers, but I would only recommend this when you need to customize the way pdfminer.six behaves.

还有一个可组合的 api,它在处理结果对象方面提供了很大的灵活性。例如,您可以使用它来实现自己的布局算法。其他答案中建议使用此方法,但我仅在您需要自定义 pdfminer.six 的行为方式时才推荐此方法。

from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('samples/simple1.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue())

回答by Cornelius Roemer

This works in May 2020 using PDFminer six in Python3.

这在 2020 年 5 月使用 Python3 中的 PDFminer 6 起作用。

Installing the package

安装包

$ pip install pdfminer.six

Importing the package

导入包

from pdfminer.high_level import extract_text

Using a PDF saved on disk

使用保存在磁盘上的 PDF

text = extract_text('report.pdf')

Or alternatively:

或者:

with open('report.pdf','rb') as f:
    text = extract_text(open('report.pdf','rb'))

Using PDF already in memory

使用内存中已有的 PDF

If the PDF is already in memory, for example if retrieved from the web with the requests library, it can be converted to a stream using the iolibrary:

如果 PDF 已经在内存中,例如,如果使用 requests 库从 Web 检索,则可以使用该io库将其转换为流:

import io

response = requests.get(url)
text = extract_text(io.BytesIO(response.content))

Performance and Reliability compared with PyPDF2

与 PyPDF2 相比的性能和可靠性

PDFminer.six works more reliably than PyPDF2 (which fails with certain types of PDFs), in particular PDF version 1.7

PDFminer.six 比 PyPDF2(在某些类型的 PDF 中失败)更可靠,尤其是 PDF 版本 1.7

However, text extraction with PDFminer.six is significantly slower than PyPDF2 by a factor of 6.

但是,使用 PDFminer.six 提取文本的速度明显比 PyPDF2 慢 6 倍。

I timed text extraction with timeiton a 15" MBP (2018), timing only the extraction function (no file opening etc.) with a 10 page PDF and got the following results:

timeit使用 15" MBP (2018)定时文本提取,仅使用 10 页 PDF 定时提取功能(无文件打开等)并得到以下结果:

PDFminer.six: 2.88 sec
PyPDF2:       0.45 sec

pdfminer.six also has a huge footprint, requiring pycryptodome which needs GCC and other things installed pushing a minimal install docker image on Alpine Linux from 80 MB to 350 MB. PyPDF2 has no noticeable storage impact.

pdfminer.six 也有巨大的占用空间,需要安装 GCC 和其他东西的 pycryptodome,将 Alpine Linux 上的最小安装 docker 映像从 80 MB 推到 350 MB。PyPDF2 没有明显的存储影响。