Ruby:阅读 PDF 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/773193/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-02 21:10:30  来源:igfitidea点击:

Ruby: Reading PDF files

ruby-on-railsrubypdfpdf-parsing

提问by Javier

I'm looking for a fast and reliable way to read/parse large PDF files in Ruby (on Linux and OSX).

我正在寻找一种快速可靠的方法来读取/解析 Ruby 中的大型 PDF 文件(在 Linux 和 OSX 上)。

Until now I've found the rather old and simple PDF-toolkit(a pdftotext-wrapper) and PDF-reader, which was unable to read most of my files. Though the two libraries provide exactly the functionality I was looking for.

到目前为止,我发现了相当古老且简单的PDF 工具包pdftotext-wrapper)和PDF-reader,它们无法读取我的大部分文件。虽然这两个库提供了我正在寻找的功能。

My question: Have I missed something? Is there a tool that is better suited (faster and more reliable) to solve my problem?

我的问题:我错过了什么吗?是否有更适合(更快、更可靠)来解决我的问题的工具?

采纳答案by pw.

You might find Docsplituseful:

您可能会发现Docsplit很有用:

Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)

Docsplit 是一个命令行实用程序和 Ruby 库,用于将文档拆分为其组成部分:可搜索的 UTF-8 纯文本、任何格式的页面图像或缩略图、PDF、单页和文档元数据(标题、作者、页数) ...)

回答by Javier

After trying different methods, I'm using PDF-Toolkitnow. It's quite old, but it's fast, stable and reliable. Besides, it really doesn't need to be new, because it just wraps the xpdf commandline utilities.

在尝试了不同的方法后,我现在正在使用PDF-Toolkit。它已经很老了,但它快速、稳定和可靠。此外,它真的不需要是新的,因为它只是包装了xpdf 命令行实用程序

回答by insane.dreamer

You could use JRuby and a Java PDF library parser such as ApachePDFBox (https://www.ohloh.net/p/pdfbox). See also http://java-source.net/open-source/pdf-libraries.

您可以使用 JRuby 和 Java PDF 库解析器,例如 ApachePDFBox ( https://www.ohloh.net/p/pdfbox)。另请参阅http://java-source.net/open-source/pdf-libraries

回答by Myst

Did you have a look at the CombinePDFlibrary?

你有没有看过CombinePDF库?

It's a pure ruby solution that allows some PDF manipulation, such as extracting pages, overlaying one PDF page over another, page numbering, writing basic text and tables, etc'.

它是一种纯 ruby​​ 解决方案,允许进行一些 PDF 操作,例如提取页面、将一个 PDF 页面覆盖在另一个页面上、页码、编写基本文本和表格等。

Here's an example for stumping an existing PDF file with a logo. The example reads a PDF file, extracts one page to use as a stamp and stamps another PDF file.

下面是一个使用徽标来阻止现有 PDF 文件的示例。该示例读取 PDF 文件,提取一页用作图章,然后在另一个 PDF 文件上盖章。

require 'combine_pdf'
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf = CombinePDF.load "content_file.pdf"
pdf.pages.each {|page| page << company_logo}
pdf.save "content_with_logo.pdf"

You can also stamp text, number pages or use :

您还可以标记文本、数字页面或使用:

require 'combine_pdf'

pdf = CombinePDF.load "content_file.pdf"

pdf.number_pages #adds page numbers. you can add formatting and placement options.

pdf.pages.each {|page| page.textbox "One Way To Stamp"}

#you can a shortcut method to stamp pages
pdf.stamp_pages "Another way to stamp"

#you can use the shortcut method for both text and PDF stamps
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf.stamp_pages company_logo

# you can use write simple tables
pdf.pages[0].write_table headers: ['first name', 'surname'], table_data: [['John', 'Doe'], ['Mr.', 'Smith']]

pdf.save "content_with_logo.pdf"

It's not meant for complex operations, but it complements most PDF authoring libraries and allows you to use PDF templates instead of writing the whole thing from scratch.

它并不适用于复杂的操作,但它补充了大多数 PDF 创作库,并允许您使用 PDF 模板而不是从头开始编写整个内容。

回答by Alexis Perrier

If you just need to get the text content out of a pdf file, pdftohtml at sourceforge is efficient. it is not suited for dealing with images.

如果您只需要从 pdf 文件中获取文本内容,sourceforge 上的 pdftohtml 非常有效。它不适合处理图像。

回答by Terry

Here's some options:

以下是一些选项:

http://en.wikipedia.org/wiki/List_of_PDF_software

http://en.wikipedia.org/wiki/List_of_PDF_software

From that link, and searching sourceforge, there's a couple of command line utilities that might do what you want, like this one: http://pdftohtml.sourceforge.net/

从该链接并搜索 sourceforge,有几个命令行实用程序可以满足您的需求,例如:http: //pdftohtml.sourceforge.net/

Depending on your requirements and what the PDFs look like, you could look at using the Google Docs API (uploading the PDF and then downloading it as text), or could also try something like gocr. I've had a lot of luck parsing image text with gocr in the past, and you'd just have to bounce out to the shell to do it, like gocr -i whatever.pdf(I think it works with PDFs).

根据您的要求和 PDF 的外观,您可以考虑使用 Google Docs API(上传 PDF,然后将其作为文本下载),或者也可以尝试使用gocr 之类的方法。过去,我很幸运地使用 gocr 解析图像文本,而您只需要跳到 shell 即可执行此操作,例如gocr -i whatever.pdf(我认为它适用于 PDF)。

The downside to all of these is that they're not pure-Ruby implementations, but lots of the good (and free) OCR projects seem to be done that way.

所有这些的缺点是它们不是纯 Ruby 实现,但许多好的(和免费的)OCR 项目似乎都是这样完成的。