Python 如何从 PDF 文件中提取文本和文本坐标?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22898145/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:55:58  来源:igfitidea点击:

How to extract text and text coordinates from a PDF file?

pythonpdfpdfminer

提问by pnj

I want to extract all the text boxes and text box coordinates from a PDF file with PDFMiner.

我想使用 PDFMiner 从 PDF 文件中提取所有文本框和文本框坐标。

Many other Stack Overflow posts address how to extract all text in an ordered fashion, but how can I do the intermediate step of getting the text and text locations?

许多其他 Stack Overflow 帖子解决了如何以有序的方式提取所有文本,但我如何执行获取文本和文本位置的中间步骤?

Given a PDF file, output should look something like:

给定一个 PDF 文件,输出应该类似于:

   489, 41,  "Signature"
   500, 52,  "b"
   630, 202, "a_g_i_r"

采纳答案by pnj

Newlines are converted to underscores in final output. This is the minimal working solution that I found.

换行符在最终输出中转换为下划线。这是我找到的最小工作解决方案。

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer

# Open a PDF file.
fp = open('/Users/me/Downloads/test.pdf', 'rb')

# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)

# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)

# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
    raise PDFTextExtractionNotAllowed

# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()

# Create a PDF device object.
device = PDFDevice(rsrcmgr)

# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()

# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)

# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)

def parse_obj(lt_objs):

    # loop over the object list
    for obj in lt_objs:

        # if it's a textbox, print text and location
        if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
            print "%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text().replace('\n', '_'))

        # if it's a container, recurse
        elif isinstance(obj, pdfminer.layout.LTFigure):
            parse_obj(obj._objs)

# loop over all pages in the document
for page in PDFPage.create_pages(document):

    # read the page into a layout object
    interpreter.process_page(page)
    layout = device.get_result()

    # extract text from this object
    parse_obj(layout._objs)

回答by Mark Amery

Here's a copy-and-paste-ready example that lists the top-left corners of every block of text in a PDF, and which I think should work for any PDF that doesn't include "Form XObjects" that have text in them:

这是一个可复制粘贴的示例,它列出了 PDF 中每个文本块的左上角,我认为它适用于任何不包含其中包含文本的“Form XObjects”的 PDF:

from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator

fp = open('yourpdf.pdf', 'rb')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(fp)

for page in pages:
    print('Processing next page...')
    interpreter.process_page(page)
    layout = device.get_result()
    for lobj in layout:
        if isinstance(lobj, LTTextBox):
            x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
            print('At %r is text: %s' % ((x, y), text))

The code above is based upon the Performing Layout Analysisexample in the PDFMiner docs, plus the examples by pnj (https://stackoverflow.com/a/22898159/1709587) and Matt Swain (https://stackoverflow.com/a/25262470/1709587). There are a couple of changes I've made from these previous examples:

上面的代码基于PDFMiner 文档中的执行布局分析示例,以及 pnj ( https://stackoverflow.com/a/22898159/1709587) 和 Matt Swain ( https://stackoverflow.com/a/ 25262470/1709587)。我对前面的示例做了一些更改:

  • I use PDFPage.get_pages(), which is a shorthand for creating a document, checking it is_extractable, and passing it to PDFPage.create_pages()
  • I don't bother handling LTFigures, since PDFMiner is currently incapable of cleanly handling text inside them anyway.
  • 我使用PDFPage.get_pages(),这是创建文档、检查它is_extractable并将其传递给的简写PDFPage.create_pages()
  • 我不LTFigure介意处理s,因为 PDFMiner 目前无论如何都无法干净地处理它们内部的文本。

LAParamslets you set some parameters that control how individual characters in the PDF get magically grouped into lines and textboxes by PDFMiner. If you're surprised that such grouping is a thing that needs to happen at all, it's justified in the pdf2txt docs:

LAParams允许您设置一些参数来控制 PDFMiner 如何将 PDF 中的单个字符神奇地分组到行和文本框中。如果您对这样的分组完全需要发生感到惊讶,那么在pdf2txt 文档中这是合理的:

In an actual PDF file, text portions might be split into several chunks in the middle of its running, depending on the authoring software. Therefore, text extraction needs to splice text chunks.

在实际的 PDF 文件中,根据创作软件的不同,文本部分可能会在其运行过程中分成几个块。因此,文本提取需要拼接文本块。

LAParams's parameters are, like most of PDFMiner, undocumented, but you can see them in the source codeor by calling help(LAParams)at your Python shell. The meaning of someof the parameters is given at https://pdfminer-docs.readthedocs.io/pdfminer_index.html#pdf2txt-pysince they can also be passed as arguments to pdf2textat the command line.

LAParams的参数,与大多数 PDFMiner 一样,未记录,但您可以在源代码中或通过help(LAParams)在 Python shell 中调用来查看它们。某些参数的含义在https://pdfminer-docs.readthedocs.io/pdfminer_index.html#pdf2txt-py 中给出,因为它们也可以作为参数传递给pdf2text命令行。

The layoutobject above is an LTPage, which is an iterable of "layout objects". Each of these layout objects can be one of the following types...

layout上面的对象是 an LTPage,它是“布局对象”的可迭代对象。这些布局对象中的每一个都可以是以下类型之一...

  • LTTextBox
  • LTFigure
  • LTImage
  • LTLine
  • LTRect
  • LTTextBox
  • LTFigure
  • LTImage
  • LTLine
  • LTRect

... or their subclasses. (In particular, your textboxes will probably all be LTTextBoxHorizontals.)

...或它们的子类。(特别是,您的文本框可能都是LTTextBoxHorizontals。)

More detail of the structure of an LTPageis shown by this image from the docs:

LTPage文档中的这张图片显示了一个结构的更多细节:

Tree diagram of the structure of an <code>LTPage</code>. Of relevance to this answer: it shows that an <code>LTPage</code>contains the 5 types listed above, and that an <code>LTTextBox</code>contains <code>LTTextLine</code>s plus unspecified other stuff, and that an <code>LTTextLine</code>contains <code>LTChar</code>s, <code>LTAnno</code>s, <code>LTText</code>s, and unspecified other stuff.

Tree diagram of the structure of an <code>LTPage</code>. Of relevance to this answer: it shows that an <code>LTPage</code>contains the 5 types listed above, and that an <code>LTTextBox</code>contains <code>LTTextLine</code>s plus unspecified other stuff, and that an <code>LTTextLine</code>contains <code>LTChar</code>s, <code>LTAnno</code>s, <code>LTText</code>s, and unspecified other stuff.

Each of the types above has a .bboxproperty that holds a (x0, y0, x1, y1) tuple containing the coordinates of the left, bottom, right, and top of the object respectively. The y-coordinates are given as the distance from the bottomof the page. If it's more convenient for you to work with the y-axis going from top to bottom instead, you can subtract them from the height of the page's .mediabox:

上面的每种类型都有一个.bbox属性,该属性包含一个 ( x0, y0, x1, y1) 元组,分别包含对象的左侧、底部、右侧和顶部的坐标。y 坐标表示为距页面底部的距离。如果使用从上到下的 y 轴更方便,您可以从页面的高度中减去它们.mediabox

x0, y0_orig, x1, y1_orig = some_lobj.bbox
y0 = page.mediabox[3] - y1_orig
y1 = page.mediabox[3] - y0_orig

In addition to a bbox, LTTextBoxes also have a .get_text()method, shown above, that returns their text content as a string. Note that each LTTextBoxis a collection of LTChars (characters explicitly drawn by the PDF, with a bbox) and LTAnnos (extra spaces that PDFMiner adds to the string representation of the text box's content based upon the characters being drawn a long way apart; these have no bbox).

除了 a 之外bboxLTTextBoxes 还有一个.get_text()方法,如上所示,将它们的文本内容作为字符串返回。请注意,每个LTTextBox都是LTChars(由 PDF 显式绘制的字符,带有 a bbox)和LTAnnos(PDFMiner 基于相距很远的字符添加到文本框内容的字符串表示中的额外空格;这些没有bbox)。

The code example at the beginning of this answer combined these two properties to show the coordinates of each block of text.

本答案开头的代码示例结合了这两个属性来显示每个文本块的坐标。

Finally, it's worth noting that, unlikethe other Stack Overflow answers cited above, I don't bother recursing into LTFigures. Although LTFigures can contain text, PDFMiner doesn't seem capable of grouping that text into LTTextBoxes (you can try yourself on the example PDF from https://stackoverflow.com/a/27104504/1709587) and instead produces an LTFigurethat directly contains LTCharobjects. You could, in principle, figure out how to piece these together into a string, but PDFMiner (as of version 20181108) can't do it for you.

最后,值得注意的是,上面引用的其他 Stack Overflow 答案不同,我不打算递归到LTFigures。尽管LTFigures 可以包含文本,但 PDFMiner 似乎无法将该文本分组为LTTextBoxes(您可以尝试使用https://stackoverflow.com/a/27104504/1709587 中的示例 PDF ),而是生成LTFigure直接包含LTChar对象的. 原则上,您可以弄清楚如何将这些拼凑成一个字符串,但 PDFMiner(从 20181108 版本开始)无法为您完成。

Hopefully, though, the PDFs you need to parse don't use Form XObjects with text in them, and so this caveat won't apply to you.

不过,希望您需要解析的 PDF 不使用其中包含文本的 Form XObjects,因此此警告不适用于您。