在python中将pdf转换为text/html以便我可以解析它

Question

提问by Thomas Jensen

I have the following sample code where I download a pdf from the European Parliament website on a given legislative proposal:

我有以下示例代码，我从欧洲议会网站下载了关于给定立法提案的 pdf：

EDIT: I ended up just getting the link and feeding it to adobes online conversion tool (see the code below):

编辑：我最终只是获得了链接并将其提供给 adobes 在线转换工具（请参阅下面的代码）：

import mechanize
import urllib2
import re
from BeautifulSoup import *

adobe = "http://www.adobe.com/products/acrobat/access_onlinetools.html"

url = "http://www.europarl.europa.eu/oeil/search_reference_procedure.jsp"

def get_pdf(soup2):
    link = soup2.findAll("a", "com_acronym")
    new_link = []
    amendments = []
    for i in link:
        if "REPORT" in i["href"]:
            new_link.append(i["href"])
    if new_link == None:
        print "No A number"
    else:
        for i in new_link:
            page = br.open(str(i)).read()
            bs = BeautifulSoup(page)
            text = bs.findAll("a")
            for i in text:
                if re.search("PDF", str(i)) != None:
                    pdf_link = "http://www.europarl.europa.eu/" + i["href"]
            pdf = urllib2.urlopen(pdf_link)
            name_pdf = "%s_%s.pdf" % (y,p)
            localfile = open(name_pdf, "w")
            localfile.write(pdf.read())
            localfile.close()

            br.open(adobe)
            br.select_form(name = "convertFrm")
            br.form["srcPdfUrl"] = str(pdf_link)
            br["convertTo"] = ["html"]
            br["visuallyImpaired"] = ["notcompatible"]
            br.form["platform"] =["Macintosh"]
            pdf_html = br.submit()

            soup = BeautifulSoup(pdf_html)


page = range(1,2) #can be set to 400 to get every document for a given year
year = range(1999,2000) #can be set to 2011 to get documents from all years

for y in year:
    for p in page:
        br = mechanize.Browser()
        br.open(url)
        br.select_form(name = "byReferenceForm")
        br.form["year"] = str(y)
        br.form["sequence"] = str(p)
        response = br.submit()
        soup1 = BeautifulSoup(response)
        test = soup1.find(text="No search result")
        if test != None:
            print "%s %s No page skipping..." % (y,p)
        else:
            print "%s %s  Writing dossier..." % (y,p)
            for i in br.links(url_regex="file.jsp"):
                link = i
            response2 = br.follow_link(link).read()
            soup2 = BeautifulSoup(response2)
            get_pdf(soup2)

In the get_pdf() function I would like to convert the pdf file to text in python so I can parse the text for information about the legislative procedure. can anyone explaon me how this can be done?

在 get_pdf() 函数中，我想将 pdf 文件转换为 python 中的文本，以便我可以解析文本以获取有关立法程序的信息。任何人都可以解释我如何做到这一点？

Thomas

托马斯

Answer 1

采纳答案by loevborg

It's not exactly magic. I suggest

这并不完全是魔术。我建议

downloading the PDF file to a temp directory,
calling out to an external program to extract the text into a (temp) text file,
reading the text file.

将 PDF 文件下载到临时目录，
调用外部程序将文本提取到（临时）文本文件中，
读取文本文件。

For text extraction command-line utilities you have a number of possibilitiesand there may be others not mentioned in the link (perhaps Java-based). Try them first to see if they fit your needs. That is, try each step separately (finding the links, downloading the files, extracting the text) and then piece them together. For calling out, use subprocess.Popenor subprocess.call().

对于文本提取命令行实用程序，您有多种可能性，链接中可能还有其他未提及的工具（可能基于 Java）。首先尝试它们，看看它们是否符合您的需求。也就是说，分别尝试每个步骤（查找链接、下载文件、提取文本），然后将它们拼凑在一起。对于呼出，使用subprocess.Popen或subprocess.call()。

Answer 2

回答by Hyman Cushman

Sounds like you found a solution, but if you ever want to do it without a web service, or you need to scrape data based on its precise location on the PDF page, can I suggest my library, pdfquery? It basically turns the PDF into an lxml tree that can be spit out as XML, or parsed with XPath, PyQuery, or whatever else you want to use.

听起来您找到了解决方案，但是如果您想在没有 Web 服务的情况下执行此操作，或者您需要根据数据在 PDF 页面上的精确位置来抓取数据，我可以推荐我的库pdfquery吗？它基本上将 PDF 变成了一个 lxml 树，可以将其作为 XML 输出，或者使用 XPath、PyQuery 或您想要使用的任何其他工具进行解析。

To use it, once you had the file saved to disk you would return pdf = pdfquery.PDFQuery(name_pdf), or pass in a urllib file object directly if you didn't need to save it. To get XML out to parse with BeautifulSoup, you could do pdf.tree.tostring().

要使用它，一旦您将文件保存到磁盘，您将返回pdf = pdfquery.PDFQuery(name_pdf)，或者如果您不需要保存它，则直接传入一个 urllib 文件对象。要使用 BeautifulSoup 解析 XML，您可以执行pdf.tree.tostring().

If you don't mind using JQuery-style selectors, there's a PyQuery interface with positional extensions, which can be pretty handy. For example:

如果您不介意使用 JQuery 风格的选择器，可以使用带有位置扩展的 PyQuery 接口，这非常方便。例如：

balance = pdf.pq(':contains("Your balance is")').text()
strings_near_the_bottom_of_page_23 = [el.text for el in pdf.pq('LTPage[page_label=23] :in_bbox(0, 0, 600, 200)')]

在python中将pdf转换为text/html以便我可以解析它

提问by Thomas Jensen

采纳答案by loevborg

回答by Hyman Cushman

相关推荐

最近更新

标签

在python中将pdf转换为text/html以便我可以解析它

提问by Thomas Jensen

采纳答案by loevborg

回答by Hyman Cushman

相关推荐

Python 继承最佳实践：*args、**kwargs 或明确指定参数

在 Python/Django 中打印变量的值？

Python 隐藏/不可见的 Matplotlib 图形

Python 不区分大小写的“in”

相关推荐

最近更新

标签