C++ 中的 PDF 解析 (PoDoFo)

Question

提问by csteifel

Hi so I'm trying to parse some text from some pdfs and I would like to use PoDoFo, now I have tried searching for examples of how to use PoDoFo to parse a pdf however all I can come up with is examples of how to create and write a pdf file which is not what I really need.

嗨，所以我正在尝试从一些 pdf 文件中解析一些文本，我想使用 PoDoFo，现在我已经尝试搜索如何使用 PoDoFo 来解析 pdf 的示例，但是我能想出的只是如何创建的示例并编写一个 pdf 文件，这不是我真正需要的。

If anyone has any tutorial or example of parsing a PDF file with PoDoFo or have suggestions for a different library that I can use please let me know. Also I know there is pdftotext on linux, however, not only can I not use that, but I would much rather be able to do everything I need to internally and not rely on outside programs being installed.

如果有人有任何使用 PoDoFo 解析 PDF 文件的教程或示例，或者对我可以使用的其他库有任何建议，请告诉我。我也知道 linux 上有 pdftotext，但是，我不仅不能使用它，而且我宁愿能够在内部完成我需要的一切，而不是依赖于安装的外部程序。

Answer 1

回答by Ferruccio

PoDoFo does not provide a means to easily extract text from a document, but it is not hard to do.

PoDoFo 没有提供从文档中轻松提取文本的方法，但并不难做到。

Load a document into a PdfMemDocument:

将文档加载到PdfMemDocument:

PoDoFo::PdfMemDocument pdf("mydoc.pdf");

Iterate over each page:

遍历每个页面：

for (int pn = 0; pn < pdf.GetPageCount(); ++pn) {
    PoDoFo::PdfPage* page = pdf.GetPage(pn);

Iterate over all the PDF commands on that page:

迭代该页面上的所有 PDF 命令：

    PoDoFo::PdfContentsTokenizer tok(page);
    const char* token = nullptr;
    PoDoFo::PdfVariant var;
    PoDoFo::EPdfContentsType type;
    while (tok.ReadNext(type, token, var)) {
        switch (type) {
            case PoDoFo::ePdfContentsType_Keyword:
                // process token: it contains the current command
                //   pop from var stack as necessary
                break;
            case PoDoFo::ePdfContentsType_Variant:
                // process var: push it onto a stack
                break;
            default:
                // should not happen!
                break;
        }
    }
}

The "process token" & "process var" comments is where it gets a little more complex. You are given raw PDF commands to process. Luckily, if you're not actually rendering the page and all you want is the text, you can ignore most of them. The commands you need to process are:

“进程令牌”和“进程变量”注释是它变得更复杂的地方。您将获得要处理的原始 PDF 命令。幸运的是，如果您实际上并未渲染页面并且您想要的只是文本，则可以忽略其中的大部分内容。您需要处理的命令是：

BT, ET, Td, TD, Ts, T, Tm, Tf, ", ', Tjand TJ

BT, ET, Td, TD, Ts, T, Tm, Tf, ", ',Tj和TJ

The BTand ETcommands mark the beginning and end of a text stream, so you want to ignore anything that's not between a BT/ETpair.

该BT和ET命令标记文本流的开头和结尾，所以你要忽略任何一间不BT/ET对。

The PDF language is RPN based. A command stream consists of values which are pushed onto a stack and commands which pop values off the stack and process them.

PDF 语言基于 RPN。命令流由压入堆栈的值和从堆栈中弹出值并处理它们的命令组成。

The ", ', Tjand TJcommands are the only ones which actually generate text. ", 'and Tjreturn a single string. Use var.IsString()and var.GetString()to process it.

的"，'，Tj和TJ命令是唯一的，实际上产生文本。",'并Tj返回单个字符串。使用var.IsString()和var.GetString()来处理它。

TJreturns an array of strings. You can extract each one with:

TJ返回一个字符串数组。您可以使用以下方法提取每一个：

if (var.isArray()) {
    PoDoFo::PdfArray& a = var.GetArray();
    for (size_t i = 0; i < a.GetSize(); ++i)
        if (a[i].IsString())
            // do something with a[i].GetString()

The other commands are used to determine when to introduce a line break. "and 'also introduce line breaks. Your best bet is to download the PDF spec from Adobe and look up the text processing section. It explains what each command does in more detail.

其他命令用于确定何时引入换行符。"并'引入换行符。最好的办法是从 Adobe 下载 PDF 规范并查找文本处理部分。它更详细地解释了每个命令的作用。

I found it very helpful to write a small program which takes a PDF file and dumps out the command stream for each page.

我发现编写一个小程序非常有帮助，它接收一个 PDF 文件并为每个页面转储命令流。

Note: If all you're doing is extracting raw text with no positioning information, you don't actually need to maintain a stack of varvalues. All the text rendering commands have, at most, one parameter. You can simply assume that the last value in varcontains the parameter for the current command.

注意：如果您所做的只是提取没有定位信息的原始文本，您实际上不需要维护一堆var值。所有的文本渲染命令最多只有一个参数。您可以简单地假设 invar中的最后一个值包含当前命令的参数。

Answer 2

回答by paddy

I haven't used PoDoFo, but a quick browse through the class hierarchy on their API webpage reveals:

我没有使用过 PoDoFo，但快速浏览他们 API 网页上的类层次结构会发现：

void PoDoFo::PdfMemDocument::Load( const char * pszFilename )

(API doc link)

( API 文档链接)

So I would just hazard a guess here, that you do:

所以我在这里冒一个猜测，你会这样做：

PoDoFo::PdfMemDocument doc;
doc.Load( "somefile.pdf" );

Then I imagine you navigate the document tree by calling doc.GetObjects()and walking through that array (see PdfDocument class)

然后我想象您通过调用doc.GetObjects()并遍历该数组来导航文档树（请参阅 PdfDocument 类）

C++ 中的 PDF 解析 (PoDoFo)

提问by csteifel

回答by Ferruccio

回答by paddy

相关推荐

最近更新

标签

C++ 中的 PDF 解析 (PoDoFo)

提问by csteifel

回答by Ferruccio

回答by paddy

相关推荐

C++ 静态数组的大小

C++ 头文件中的 const 数组声明

C++ 用C++在头文件中编写函数定义

在 C++ 中解析数学表达式

相关推荐

最近更新

标签