Linux 将数据从 PDF 文件读入 R

Question

提问by Justin

Is that even possible!?!

这甚至可能！？！

I have a bunch of legacy reports that I need to import into a database. However, they're all in pdf format. Are there any Rpackages that can read pdf? Or should I leave that to a command line tool?

我有一堆遗留报告需要导入到数据库中。但是，它们都是pdf格式。有没有R可以阅读pdf的包？或者我应该把它留给命令行工具？

The reports were made in excel and then pdfed, so they have regular structure, but many blank "cells".

报告是用excel制作的，然后是pdf格式，所以它们有规则的结构，但有很多空白的“单元格”。

Answer 1

采纳答案by Carl Witthoft

Just a warning to others who may be hoping to extract data: PDF is a container, not a format. If the original document does not contain actual text, as opposed to bitmapped images of text or possibly even uglier things than I can imagine, nothing other than OCR can help you.

只是对可能希望提取数据的其他人的警告：PDF 是一种容器，而不是一种格式。如果原始文档不包含实际文本，而不是文本的位图图像或可能比我想象的更丑陋的东西，那么除了 OCR 之外没有其他方法可以帮助您。

On top of that, in my sad experience there's no guarantee that apps which create PDF docs all behave the same, so the data in your table may or may not be read out in the desired order (as a result of the way the doc was built). Be cautious.

最重要的是，在我悲惨的经历中，无法保证创建 PDF 文档的应用程序的行为都相同，因此表格中的数据可能会或可能不会以所需的顺序读出（由于文档的方式建）。要小心。

Probably better to make a couple grad students transcribe the data for you. They're cheap :-)

让几个研究生为您转录数据可能更好。他们很便宜:-)

Answer 2

回答by Justin

So... this gets me close even on a fairly complex table.

所以......即使在一个相当复杂的桌子上，这也让我接近。

Download a sample pdf from bmi pdf

从bmi pdf下载示例pdf

library(tm)

pdf <- readPDF(PdftotextOptions = "-layout")

dat <- pdf(elem = list(uri='bmi_tbl.pdf'), language='en', id='id1')

dat <- gsub(' +', ',', dat)
out <- read.csv(textConnection(dat), header=FALSE)

Answer 3

回答by Paul McGee

per zx8754 ... the following works in Win7 with pdftotext.exe in the working directory:

根据 zx8754 ... 以下在工作目录中使用 pdftotext.exe 在 Win7 中工作：

library(tm)
uri = 'bmi_tbl.pdf'
pdf = readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                                language = "en", id = "id1")

Answer 4

回答by hrbrmstr

You can also (now) use the new (2015-07) Rpopplerpacakge:

您还可以（现在）使用新的 (2015-07) pacakge Rpoppler：

Rpoppler::PDF_text(file)

It includes 3 functions (4, really, but one just gets you a ptr to the PDF object):

它包括 3 个函数（实际上是 4 个，但一个只是为您提供了一个指向 PDF 对象的 ptr）：

PDF_fontsPDF font information
PDF_infoPDF document information
PDF_textPDF text extraction

PDF_fontsPDF字体信息
PDF_infoPDF文件信息
PDF_textPDF文本提取

(posting as an answer to help new searchers find the package).

（发布作为帮助新搜索者找到包裹的答案）。

Answer 5

回答by Ben

The current package du jourfor getting text out of PDFs is pdftools(successor to Rpoppler, noted above), works great on Linux, Windows and OSX:

当前用于从 PDF 中获取文本的包du jour是pdftools（Rpoppler 的后继者，如上所述），在 Linux、Windows 和 OSX 上运行良好：

install.packages("pdftools")
library(pdftools)
download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
txt <- pdf_text("1403.2805.pdf")

# first page text
cat(txt[1])

# second page text
cat(txt[2])

Linux 将数据从 PDF 文件读入 R

提问by Justin

采纳答案by Carl Witthoft

回答by Justin

回答by Paul McGee

回答by hrbrmstr

回答by Ben

相关推荐

最近更新

标签

Linux 将数据从 PDF 文件读入 R

提问by Justin

采纳答案by Carl Witthoft

回答by Justin

回答by Paul McGee

回答by hrbrmstr

回答by Ben

相关推荐

C# 如何以编程方式修改 WCF app.config 端点地址设置？

Linux 在 CentOS 上安装 glassfish 的 /tmp 文件夹的权限被拒绝

在 LINUX 上将 Tomcat 作为服务启动

在 C# 中进行浅拷贝的最快方法

相关推荐

最近更新

标签