使用 Python 在 PDF 中搜索文本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17098675/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:27:30  来源:igfitidea点击:

Searching text in a PDF using Python?

pythonparsingpdftext

提问by Insarov

Problem
I'm trying to determine what type a document is (e.g. pleading, correspondence, subpoena, etc) by searching through its text, preferably using python. All PDFs are searchable, but I haven't found a solution to parsing it with python and applying a script to search it (short of converting it to a text file first, but that could be resource-intensive for n documents).

问题
我试图通过搜索文本来确定文档的类型(例如诉状、信函、传票等),最好使用 python。所有 PDF 都是可搜索的,但我还没有找到用 python 解析它并应用脚本来搜索它的解决方案(没有先将它转换为文本文件,但这对于 n 个文档来说可能是资源密集型的)。

What I've done so far
I've looked into pypdf, pdfminer, adobe pdf documentation, and any questions here I could find (though none seemed to directly solve this issue). PDFminer seems to have the most potential, but after reading through the documentation I'm not even sure where to begin.

到目前为止
所做的我已经查看了 pypdf、pdfminer、adobe pdf 文档,以及我能在这里找到的任何问题(尽管似乎没有一个能直接解决这个问题)。PDFminer 似乎最有潜力,但在阅读了文档后,我什至不知道从哪里开始。

Is there a simple, effective method for reading PDF text, either by page, line, or the entire document? Or any other workarounds?

有没有一种简单有效的方法可以按页、行或整个文档阅读 PDF 文本?或者任何其他解决方法?

回答by Paulo Scardine

This is called PDF mining, and is very hard because:

这称为 PDF 挖掘,非常困难,因为:

  • PDF is a document format designed to be printed, not to be parsed. Inside a PDF document, text is in no particular order (unless order is important for printing), most of the time the original text structure is lost (letters may not be grouped as words and words may not be grouped in sentences, and the order they are placed in the paper is often random).
  • There are tons of software generating PDFs, many are defective.
  • PDF 是一种设计用于打印而非解析的文档格式。在PDF文档中,文本没有特定的顺序(除非顺序对打印很重要),大多数情况下丢失了原始文本结构(字母可能不会分组为单词,单词可能不会分组为句子,顺序它们被放置在纸上往往是随机的)。
  • 有大量生成 PDF 的软件,其中许多是有缺陷的。

Tools like PDFminer use heuristics to group letters and words again based on their position in the page. I agree, the interface is pretty low level, but it makes more sense when you know what problem they are trying to solve (in the end, what matters is choosing how close from the neighbors a letter/word/line has to be in order to be considered part of a paragraph).

PDFminer 等工具使用启发式方法根据字母和单词在页面中的位置再次对其进行分组。我同意,界面相当低级,但是当你知道他们试图解决什么问题时它会更有意义(最后,重要的是选择一个字母/单词/行必须与邻居的距离有多近被视为段落的一部分)。

An expensive alternative (in terms of time/computer power) is generating images for each page and feeding them to OCR, may be worth a try if you have a very good OCR.

一个昂贵的替代方案(在时间/计算机能力方面)是为每个页面生成图像并将它们提供给 OCR,如果您有一个非常好的 OCR,可能值得一试。

So my answer is no, there is no such thing as a simple, effective method for extracting text from PDF files - if your documents have a known structure, you can fine-tune the rules and get good results, but it is always a gambling.

所以我的答案是否定的,没有一种简单有效的从PDF文件中提取文本的方法——如果你的文档有已知的结构,你可以微调规则并得到好的结果,但这始终是一种赌博.

I would really like to be proven wrong.

我真的很想被证明是错误的。

[update]

[更新]

The answer has not changed but recently I was involved with two projects: one of them is using computer vision in order to extract data from scanned hospital forms. The other extracts data from court records. What I learned is:

答案没有改变,但最近我参与了两个项目:其中一个是使用计算机视觉从扫描的医院表格中提取数据。另一个从法庭记录中提取数据。我学到的是:

  1. Computer vision is at reach of mere mortals in 2018. If you have a good sample of already classified documents you can use OpenCV or SciKit-Image in order to extract features and train a machine learning classifier to determine what type a document is.

  2. If the PDF you are analyzing is "searchable", you can get very far extracting all the text using a software like pdftotextand a Bayesian filter (same kind of algorithm used to classify SPAM).

  1. 计算机视觉在 2018 年是凡人所能达到的。 如果你有一个很好的已经分类的文档样本,你可以使用 OpenCV 或 SciKit-Image 来提取特征并训练机器学习分类器来确定文档是什么类型。

  2. 如果您正在分析的 PDF 是“可搜索的”,您可以使用pdftotext和贝叶斯过滤器(用于分类垃圾邮件的同一类算法)等软件来提取所有文本。

So there is no reliable and effective method for extracting text from PDF files but you may not need one in order to solve the problem at hand (document type classification).

因此,没有可靠有效的方法从 PDF 文件中提取文本,但您可能不需要一种方法来解决手头的问题(文档类型分类)。

回答by qwwqwwq

I agree with @Paulo PDF data-mining is a huge pain. But you might have success with pdftotextwhich is part of the Xpdf suite freely available here:

我同意@Paulo PDF 数据挖掘是一个巨大的痛苦。但是您可能会成功pdftotext使用此处免费提供的 Xpdf 套件的一部分:

http://www.foolabs.com/xpdf/download.html

http://www.foolabs.com/xpdf/download.html

This should be sufficient for your purpose if you are just looking for single keywords.

如果您只是在寻找单个关键字,这应该足以满足您的目的。

pdftotextis a command line utility, but very straightforward to use. It will give you text files, which you may find easier to work with.

pdftotext是一个命令行实用程序,但使用起来非常简单。它将为您提供文本文件,您可能会发现这些文件更易于使用。

回答by MikeHunter

I've written extensive systems for the company I work for to convert PDF's into data for processing (invoices, settlements, scanned tickets, etc.), and @Paulo Scardine is correct--there is no completely reliable and easy way to do this. That said, the fastest, most reliable, and least-intensive way is to use pdftotext, part of the xpdfset of tools. This tool will quickly convert searchable PDF's to a text file, which you can read and parse with Python. Hint: Use the -layoutargument. And by the way, not all PDF's are searchable, only those that contain text. Some PDF's contain only images with no text at all.

我为我工作的公司编写了广泛的系统,将 PDF 转换为数据进行处理(发票、结算、扫描的票等),@Paulo Scardine 是正确的——没有完全可靠和简单的方法来做到这一点. 也就是说,最快、最可靠和最少的方法是使用xpdf工具集的pdftotext一部分。此工具将快速将可搜索的 PDF 转换为文本文件,您可以使用 Python 阅读和解析该文件。提示:使用参数。顺便说一下,并不是所有的 PDF 都可以搜索,只有那些包含文本的。一些 PDF 只包含完全没有文本的图像。-layout

回答by JasTonAChair

I recently started using ScraperWiki to do what you described.

我最近开始使用 ScraperWiki 来做你所描述的。

Here's an exampleof using ScraperWiki to extract PDF data.

这是使用 ScraperWiki 提取 PDF 数据的示例

The scraperwiki.pdftoxml()function returns an XML structure.

scraperwiki.pdftoxml()函数返回一个 XML 结构。

You can then use BeautifulSoup to parse that into a navigatable tree.

然后,您可以使用 BeautifulSoup 将其解析为可导航树。

Here's my code for -

这是我的代码 -

import scraperwiki, urllib2
from bs4 import BeautifulSoup

def send_Request(url):
#Get content, regardless of whether an HTML, XML or PDF file
    pageContent = urllib2.urlopen(url)
    return pageContent

def process_PDF(fileLocation):
#Use this to get PDF, covert to XML
    pdfToProcess = send_Request(fileLocation)
    pdfToObject = scraperwiki.pdftoxml(pdfToProcess.read())
    return pdfToObject

def parse_HTML_tree(contentToParse):
#returns a navigatibale tree, which you can iterate through
    soup = BeautifulSoup(contentToParse)
    return soup

pdf = process_PDF('http://greenteapress.com/thinkstats/thinkstats.pdf')
pdfToSoup = parse_HTML_tree(pdf)
soupToArray = pdfToSoup.findAll('text')
for line in soupToArray:
    print line

This code is going to print a whole, big ugly pile of <text>tags. Each page is separated with a </page>, if that's any consolation.

这段代码将打印一大堆丑陋的<text>标签。每个页面都用 分隔</page>,如果这是任何安慰的话。

If you want the content inside the <text>tags, which might include headings wrapped in <b>for example, use line.contents

如果您想要<text>标签内的内容<b>(例如可能包括标题),请使用line.contents

If you only want each line of text, not including tags, use line.getText()

如果您只想要每行文本,不包括标签,请使用 line.getText()

It's messy, and painful, but this will work for searchable PDF docs. So far I've found this to be accurate, but painful.

这很麻烦,也很痛苦,但这适用于可搜索的 PDF 文档。到目前为止,我发现这是准确的,但很痛苦。

回答by florin27

Here is the solution that I found it comfortable for this issue. In the text variable you get the text from PDF in order to search in it. But I have kept also the idea of spiting the text in keywords as I found on this website: https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544ffrom were I took this solution, although making nltk was not very straightforward, it might be useful for further purposes:

这是我发现它对这个问题很满意的解决方案。在 text 变量中,您可以从 PDF 中获取文本以便在其中进行搜索。但我也保留了在本网站上发现的在关键字中吐出文本的想法:https: //medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python -85aab86c544f来自我采用了这个解决方案,虽然制作 nltk 不是很简单,但它可能对进一步的目的有用:

import PyPDF2 
import textract

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def searchInPDF(filename, key):
    occurrences = 0
    pdfFileObj = open(filename,'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    num_pages = pdfReader.numPages
    count = 0
    text = ""
    while count < num_pages:
        pageObj = pdfReader.getPage(count)
        count +=1
        text += pageObj.extractText()
    if text != "":
       text = text
    else:
       text = textract.process(filename, method='tesseract', language='eng')
    tokens = word_tokenize(text)
    punctuation = ['(',')',';',':','[',']',',']
    stop_words = stopwords.words('english')
    keywords = [word for word in tokens if not word in stop_words and  not word in punctuation]
    for k in keywords:
        if key == k: occurrences+=1
    return occurrences 

pdf_filename = '/home/florin/Downloads/python.pdf'
search_for = 'string'
print searchInPDF (pdf_filename,search_for)

回答by Emma Yu

I am totally a green hand, but this script works for me:

我完全是一个新手,但这个脚本对我有用:

# import packages
import PyPDF2
import re

# open the pdf file
object = PyPDF2.PdfFileReader("test.pdf")

# get number of pages
NumPages = object.getNumPages()

# define keyterms
String = "Social"

# extract text and do the search
for i in range(0, NumPages):
    PageObj = object.getPage(i)
    print("this is page " + str(i)) 
    Text = PageObj.extractText() 
    # print(Text)
    ResSearch = re.search(String, Text)
    print(ResSearch)

回答by Cory Brickner

Trying to pick through PDFs for keywords is not an easy thing to do. I tried to use the pdfminer library with very limited success. It's basically because PDFs are pandemonium incarnate when it comes to structure. Everything in a PDF can stand on its own or be a part of a horizontal or vertical section, backwards or forwards. Pdfminer was having issues translating one page, not recognizing the font, so I tried another direction — optical character recognition of the document. That worked out almost perfectly.

尝试从 PDF 中挑选关键字并不是一件容易的事。我尝试使用 pdfminer 库,但收效甚微。这基本上是因为 PDF 在结构方面是混乱的化身。PDF 中的所有内容都可以独立存在,也可以成为水平或垂直部分的一部分,向后或向前。Pdfminer 在翻译一页时遇到问题,无法识别字体,所以我尝试了另一个方向——文档的光学字符识别。这几乎完美无缺。

Wand converts all the separate pages in the PDF into image blobs, then you run OCR over the image blobs. What I have as a BytesIO object is the content of the PDF file from the web request. BytesIO is a streaming object that simulates a file load as if the object was coming off of disk, which wand requires as the file parameter. This allows you to just take the data in memory instead of having to save the file to disk first and then load it.

Wand 将 PDF 中的所有单独页面转换为图像 blob,然后您对图像 blob 运行 OCR。我所拥有的 BytesIO 对象是来自 Web 请求的 PDF 文件的内容。BytesIO 是一个流对象,它模拟文件加载,就好像对象从磁盘中取出一样,这需要作为文件参数。这使您只需将数据保存在内存中,而不必先将文件保存到磁盘,然后再加载它。

Here's a very basic code block that should be able to get you going. I can envision various functions that would loop through different URL / files, different keyword searches for each file, and different actions to take, possibly even per keyword and file.

这是一个非常基本的代码块,应该能够帮助您前进。我可以设想各种函数会循环遍历不同的 URL/文件、每个文件的不同关键字搜索以及要采取的不同操作,甚至可能是每个关键字和文件。

# http://docs.wand-py.org/en/0.5.9/
# http://www.imagemagick.org/script/formats.php
# brew install freetype imagemagick
# brew install PIL
# brew install tesseract
# pip3 install wand
# pip3 install pyocr
import pyocr.builders
import requests
from io import BytesIO
from PIL import Image as PI
from wand.image import Image

if __name__ == '__main__':
    pdf_url = 'https://www.vbgov.com/government/departments/city-clerk/city-council/Documents/CurrentBriefAgenda.pdf'
    req = requests.get(pdf_url)
    content_type = req.headers['Content-Type']
    modified_date = req.headers['Last-Modified']
    content_buffer = BytesIO(req.content)
    search_text = 'tourism investment program'

    if content_type == 'application/pdf':
        tool = pyocr.get_available_tools()[0]
        lang = 'eng' if tool.get_available_languages().index('eng') >= 0 else None
        image_pdf = Image(file=content_buffer, format='pdf', resolution=600)
        image_jpeg = image_pdf.convert('jpeg')

        for img in image_jpeg.sequence:
            img_page = Image(image=img)
            txt = tool.image_to_string(
                PI.open(BytesIO(img_page.make_blob('jpeg'))),
                lang=lang,
                builder=pyocr.builders.TextBuilder()
            )
            if search_text in txt.lower():
                print('Alert! {} {} {}'.format(search_text, txt.lower().find(search_text),
                                               modified_date))

    req.close()