是否有 C++ 库可以从 PDF 文件(如 Java PDFBox)中提取文本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9951427/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 13:25:36  来源:igfitidea点击:

Is there a C++ library to extract text from a PDF file like PDFBox for Java?

c++pdf

提问by Adam Smith

Last year, I made an application in Java using PDFBox to get the raw text in some PDF files and I need to port that application to C++ now.

去年,我使用 PDFBox 在 Java 中创建了一个应用程序来获取一些 PDF 文件中的原始文本,现在我需要将该应用程序移植到 C++。

I wanted to know what was the best C++ alternative to accomplish what I need.

我想知道什么是完成我需要的最佳 C++ 替代方案。

I'll give an example in case it helps:

如果有帮助,我将举一个例子:

Most files will look like this: http://www.jumbala.net/backup/league.pdf

大多数文件看起来像这样:http: //www.jumbala.net/backup/league.pdf

With PDFBox, using that file, each line read on page 2 and most of page 3 would output all the data of a line, separated by a space instead of keeping it in a grid like it is now.

使用 PDFBox,使用该文件,在第 2 页和第 3 页的大部分读取的每一行将输出一行的所有数据,用空格分隔,而不是像现在一样将其保留在网格中。

So the first relevant line in page 2 would look like this:

因此,第 2 页中的第一行将如下所示:

FB 847 - Tremblay, Gérard 179,63 56 16167 90 268 s27 p3 669 s14 199 223 193 615

or something like that since there are minor changes in the order they appear, but I don't care about that as long as similar lines output the same since I just parse them and put the values I need in different variables.

或者类似的东西,因为它们出现的顺序有细微的变化,但我不在乎,只要类似的行输出相同,因为我只是解析它们并将我需要的值放在不同的变量中。

So, knowing all of that, is there a library that I can use in a C++ program to get similar results?

那么,知道所有这些,是否有一个库可以在 C++ 程序中使用以获得类似的结果?

Edit:After looking at sacredFaith's link at http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-fileand trying it, I'm getting a weird output like such for the example file I mentioned earlier:

编辑:http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file查看了神圣信仰的链接并尝试之后,我得到了一个奇怪的输出就像我之前提到的示例文件:

http://www.jumbala.net/backup/league.pdf.txt

http://www.jumbala.net/backup/league.pdf.txt

The parts I actually need are in the weird characters at the beginning. Using Adobe Acrobat Reader X and using Save As... Text (accessible), I get the following result:

我真正需要的部分在开头的奇怪字符中。使用 Adob​​e Acrobat Reader X 并使用另存为...文本(可访问),我得到以下结果:

http://www.jumbala.net/backup/league_good.pdf.txt

http://www.jumbala.net/backup/league_good.pdf.txt

Which is approximately what I get in Java using PDFBox and what I want to get as output in C++.

这大约是我使用 PDFBox 在 Java 中获得的内容以及我想在 C++ 中作为输出获得的内容。

采纳答案by Charles Salvia

Xpdfis a C++ application/library which includes tools to extract plain text from a PDF file.

Xpdf是一个 C++ 应用程序/库,其中包括从 PDF 文件中提取纯文本的工具。

回答by grifos

Since that's what your looking for : PoDoFois C++ library to parse/read/modify or create pdf files. The library is cross-platform.

因为这就是您要寻找的内容:PoDoFo是用于解析/读取/修改或创建 pdf 文件的 C++ 库。该库是跨平台的。

回答by sacredfaith

I've never used the following, but after some Googling I found this:

我从未使用过以下内容,但经过一番谷歌搜索后,我发现了这一点:

http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file

http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file