在python中将pdf转换为text/html以便我可以解析它
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3637781/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Converting a pdf to text/html in python so I can parse it
提问by Thomas Jensen
I have the following sample code where I download a pdf from the European Parliament website on a given legislative proposal:
我有以下示例代码,我从欧洲议会网站下载了关于给定立法提案的 pdf:
EDIT: I ended up just getting the link and feeding it to adobes online conversion tool (see the code below):
编辑:我最终只是获得了链接并将其提供给 adobes 在线转换工具(请参阅下面的代码):
import mechanize
import urllib2
import re
from BeautifulSoup import *
adobe = "http://www.adobe.com/products/acrobat/access_onlinetools.html"
url = "http://www.europarl.europa.eu/oeil/search_reference_procedure.jsp"
def get_pdf(soup2):
link = soup2.findAll("a", "com_acronym")
new_link = []
amendments = []
for i in link:
if "REPORT" in i["href"]:
new_link.append(i["href"])
if new_link == None:
print "No A number"
else:
for i in new_link:
page = br.open(str(i)).read()
bs = BeautifulSoup(page)
text = bs.findAll("a")
for i in text:
if re.search("PDF", str(i)) != None:
pdf_link = "http://www.europarl.europa.eu/" + i["href"]
pdf = urllib2.urlopen(pdf_link)
name_pdf = "%s_%s.pdf" % (y,p)
localfile = open(name_pdf, "w")
localfile.write(pdf.read())
localfile.close()
br.open(adobe)
br.select_form(name = "convertFrm")
br.form["srcPdfUrl"] = str(pdf_link)
br["convertTo"] = ["html"]
br["visuallyImpaired"] = ["notcompatible"]
br.form["platform"] =["Macintosh"]
pdf_html = br.submit()
soup = BeautifulSoup(pdf_html)
page = range(1,2) #can be set to 400 to get every document for a given year
year = range(1999,2000) #can be set to 2011 to get documents from all years
for y in year:
for p in page:
br = mechanize.Browser()
br.open(url)
br.select_form(name = "byReferenceForm")
br.form["year"] = str(y)
br.form["sequence"] = str(p)
response = br.submit()
soup1 = BeautifulSoup(response)
test = soup1.find(text="No search result")
if test != None:
print "%s %s No page skipping..." % (y,p)
else:
print "%s %s Writing dossier..." % (y,p)
for i in br.links(url_regex="file.jsp"):
link = i
response2 = br.follow_link(link).read()
soup2 = BeautifulSoup(response2)
get_pdf(soup2)
In the get_pdf() function I would like to convert the pdf file to text in python so I can parse the text for information about the legislative procedure. can anyone explaon me how this can be done?
在 get_pdf() 函数中,我想将 pdf 文件转换为 python 中的文本,以便我可以解析文本以获取有关立法程序的信息。任何人都可以解释我如何做到这一点?
Thomas
托马斯
采纳答案by loevborg
It's not exactly magic. I suggest
这并不完全是魔术。我建议
- downloading the PDF file to a temp directory,
- calling out to an external program to extract the text into a (temp) text file,
- reading the text file.
- 将 PDF 文件下载到临时目录,
- 调用外部程序将文本提取到(临时)文本文件中,
- 读取文本文件。
For text extraction command-line utilities you have a number of possibilitiesand there may be others not mentioned in the link (perhaps Java-based). Try them first to see if they fit your needs. That is, try each step separately (finding the links, downloading the files, extracting the text) and then piece them together. For calling out, use subprocess.Popenor subprocess.call().
对于文本提取命令行实用程序,您有多种可能性,链接中可能还有其他未提及的工具(可能基于 Java)。首先尝试它们,看看它们是否符合您的需求。也就是说,分别尝试每个步骤(查找链接、下载文件、提取文本),然后将它们拼凑在一起。对于呼出,使用subprocess.Popen或subprocess.call()。
回答by Hyman Cushman
Sounds like you found a solution, but if you ever want to do it without a web service, or you need to scrape data based on its precise location on the PDF page, can I suggest my library, pdfquery? It basically turns the PDF into an lxml tree that can be spit out as XML, or parsed with XPath, PyQuery, or whatever else you want to use.
听起来您找到了解决方案,但是如果您想在没有 Web 服务的情况下执行此操作,或者您需要根据数据在 PDF 页面上的精确位置来抓取数据,我可以推荐我的库pdfquery吗?它基本上将 PDF 变成了一个 lxml 树,可以将其作为 XML 输出,或者使用 XPath、PyQuery 或您想要使用的任何其他工具进行解析。
To use it, once you had the file saved to disk you would return pdf = pdfquery.PDFQuery(name_pdf), or pass in a urllib file object directly if you didn't need to save it. To get XML out to parse with BeautifulSoup, you could do pdf.tree.tostring().
要使用它,一旦您将文件保存到磁盘,您将返回pdf = pdfquery.PDFQuery(name_pdf),或者如果您不需要保存它,则直接传入一个 urllib 文件对象。要使用 BeautifulSoup 解析 XML,您可以执行pdf.tree.tostring().
If you don't mind using JQuery-style selectors, there's a PyQuery interface with positional extensions, which can be pretty handy. For example:
如果您不介意使用 JQuery 风格的选择器,可以使用带有位置扩展的 PyQuery 接口,这非常方便。例如:
balance = pdf.pq(':contains("Your balance is")').text()
strings_near_the_bottom_of_page_23 = [el.text for el in pdf.pq('LTPage[page_label=23] :in_bbox(0, 0, 600, 200)')]

