Linux 将数据从 PDF 文件读入 R
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9185831/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Reading data from PDF files into R
提问by Justin
Is that even possible!?!
这甚至可能!?!
I have a bunch of legacy reports that I need to import into a database. However, they're all in pdf format. Are there any R
packages that can read pdf? Or should I leave that to a command line tool?
我有一堆遗留报告需要导入到数据库中。但是,它们都是pdf格式。有没有R
可以阅读pdf的包?或者我应该把它留给命令行工具?
The reports were made in excel and then pdfed, so they have regular structure, but many blank "cells".
报告是用excel制作的,然后是pdf格式,所以它们有规则的结构,但有很多空白的“单元格”。
采纳答案by Carl Witthoft
Just a warning to others who may be hoping to extract data: PDF is a container, not a format. If the original document does not contain actual text, as opposed to bitmapped images of text or possibly even uglier things than I can imagine, nothing other than OCR can help you.
只是对可能希望提取数据的其他人的警告:PDF 是一种容器,而不是一种格式。如果原始文档不包含实际文本,而不是文本的位图图像或可能比我想象的更丑陋的东西,那么除了 OCR 之外没有其他方法可以帮助您。
On top of that, in my sad experience there's no guarantee that apps which create PDF docs all behave the same, so the data in your table may or may not be read out in the desired order (as a result of the way the doc was built). Be cautious.
最重要的是,在我悲惨的经历中,无法保证创建 PDF 文档的应用程序的行为都相同,因此表格中的数据可能会或可能不会以所需的顺序读出(由于文档的方式建)。要小心。
Probably better to make a couple grad students transcribe the data for you. They're cheap :-)
让几个研究生为您转录数据可能更好。他们很便宜:-)
回答by Justin
So... this gets me close even on a fairly complex table.
所以......即使在一个相当复杂的桌子上,这也让我接近。
Download a sample pdf from bmi pdf
library(tm)
pdf <- readPDF(PdftotextOptions = "-layout")
dat <- pdf(elem = list(uri='bmi_tbl.pdf'), language='en', id='id1')
dat <- gsub(' +', ',', dat)
out <- read.csv(textConnection(dat), header=FALSE)
回答by Paul McGee
per zx8754 ... the following works in Win7 with pdftotext.exe in the working directory:
根据 zx8754 ... 以下在工作目录中使用 pdftotext.exe 在 Win7 中工作:
library(tm)
uri = 'bmi_tbl.pdf'
pdf = readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
language = "en", id = "id1")
回答by hrbrmstr
You can also (now) use the new (2015-07) Rpoppler
pacakge:
您还可以(现在)使用新的 (2015-07) pacakge Rpoppler
:
Rpoppler::PDF_text(file)
It includes 3 functions (4, really, but one just gets you a ptr to the PDF object):
它包括 3 个函数(实际上是 4 个,但一个只是为您提供了一个指向 PDF 对象的 ptr):
PDF_fonts
PDF font informationPDF_info
PDF document informationPDF_text
PDF text extraction
PDF_fonts
PDF字体信息PDF_info
PDF文件信息PDF_text
PDF文本提取
(posting as an answer to help new searchers find the package).
(发布作为帮助新搜索者找到包裹的答案)。
回答by Ben
The current package du jourfor getting text out of PDFs is pdftools
(successor to Rpoppler, noted above), works great on Linux, Windows and OSX:
当前用于从 PDF 中获取文本的包du jour是pdftools
(Rpoppler 的后继者,如上所述),在 Linux、Windows 和 OSX 上运行良好:
install.packages("pdftools")
library(pdftools)
download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
txt <- pdf_text("1403.2805.pdf")
# first page text
cat(txt[1])
# second page text
cat(txt[2])