C++ 中的 PDF 解析 (PoDoFo)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11715561/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
PDF parsing in C++ (PoDoFo)
提问by csteifel
Hi so I'm trying to parse some text from some pdfs and I would like to use PoDoFo, now I have tried searching for examples of how to use PoDoFo to parse a pdf however all I can come up with is examples of how to create and write a pdf file which is not what I really need.
嗨,所以我正在尝试从一些 pdf 文件中解析一些文本,我想使用 PoDoFo,现在我已经尝试搜索如何使用 PoDoFo 来解析 pdf 的示例,但是我能想出的只是如何创建的示例并编写一个 pdf 文件,这不是我真正需要的。
If anyone has any tutorial or example of parsing a PDF file with PoDoFo or have suggestions for a different library that I can use please let me know. Also I know there is pdftotext on linux, however, not only can I not use that, but I would much rather be able to do everything I need to internally and not rely on outside programs being installed.
如果有人有任何使用 PoDoFo 解析 PDF 文件的教程或示例,或者对我可以使用的其他库有任何建议,请告诉我。我也知道 linux 上有 pdftotext,但是,我不仅不能使用它,而且我宁愿能够在内部完成我需要的一切,而不是依赖于安装的外部程序。
回答by Ferruccio
PoDoFo does not provide a means to easily extract text from a document, but it is not hard to do.
PoDoFo 没有提供从文档中轻松提取文本的方法,但并不难做到。
Load a document into a PdfMemDocument
:
将文档加载到PdfMemDocument
:
PoDoFo::PdfMemDocument pdf("mydoc.pdf");
Iterate over each page:
遍历每个页面:
for (int pn = 0; pn < pdf.GetPageCount(); ++pn) {
PoDoFo::PdfPage* page = pdf.GetPage(pn);
Iterate over all the PDF commands on that page:
迭代该页面上的所有 PDF 命令:
PoDoFo::PdfContentsTokenizer tok(page);
const char* token = nullptr;
PoDoFo::PdfVariant var;
PoDoFo::EPdfContentsType type;
while (tok.ReadNext(type, token, var)) {
switch (type) {
case PoDoFo::ePdfContentsType_Keyword:
// process token: it contains the current command
// pop from var stack as necessary
break;
case PoDoFo::ePdfContentsType_Variant:
// process var: push it onto a stack
break;
default:
// should not happen!
break;
}
}
}
The "process token" & "process var" comments is where it gets a little more complex. You are given raw PDF commands to process. Luckily, if you're not actually rendering the page and all you want is the text, you can ignore most of them. The commands you need to process are:
“进程令牌”和“进程变量”注释是它变得更复杂的地方。您将获得要处理的原始 PDF 命令。幸运的是,如果您实际上并未渲染页面并且您想要的只是文本,则可以忽略其中的大部分内容。您需要处理的命令是:
BT
, ET
, Td
, TD
, Ts
, T
, Tm
, Tf
, "
, '
, Tj
and TJ
BT
, ET
, Td
, TD
, Ts
, T
, Tm
, Tf
, "
, '
,Tj
和TJ
The BT
and ET
commands mark the beginning and end of a text stream, so you want to ignore anything that's not between a BT
/ET
pair.
该BT
和ET
命令标记文本流的开头和结尾,所以你要忽略任何一间不BT
/ET
对。
The PDF language is RPN based. A command stream consists of values which are pushed onto a stack and commands which pop values off the stack and process them.
PDF 语言基于 RPN。命令流由压入堆栈的值和从堆栈中弹出值并处理它们的命令组成。
The "
, '
, Tj
and TJ
commands are the only ones which actually generate text. "
, '
and Tj
return a single string. Use var.IsString()
and var.GetString()
to process it.
的"
,'
,Tj
和TJ
命令是唯一的,实际上产生文本。"
,'
并Tj
返回单个字符串。使用var.IsString()
和var.GetString()
来处理它。
TJ
returns an array of strings. You can extract each one with:
TJ
返回一个字符串数组。您可以使用以下方法提取每一个:
if (var.isArray()) {
PoDoFo::PdfArray& a = var.GetArray();
for (size_t i = 0; i < a.GetSize(); ++i)
if (a[i].IsString())
// do something with a[i].GetString()
The other commands are used to determine when to introduce a line break. "
and '
also introduce line breaks. Your best bet is to download the PDF spec from Adobe and look up the text processing section. It explains what each command does in more detail.
其他命令用于确定何时引入换行符。"
并'
引入换行符。最好的办法是从 Adobe 下载 PDF 规范并查找文本处理部分。它更详细地解释了每个命令的作用。
I found it very helpful to write a small program which takes a PDF file and dumps out the command stream for each page.
我发现编写一个小程序非常有帮助,它接收一个 PDF 文件并为每个页面转储命令流。
Note: If all you're doing is extracting raw text with no positioning information, you don't actually need to maintain a stack of var
values. All the text rendering commands have, at most, one parameter. You can simply assume that the last value in var
contains the parameter for the current command.
注意:如果您所做的只是提取没有定位信息的原始文本,您实际上不需要维护一堆var
值。所有的文本渲染命令最多只有一个参数。您可以简单地假设 invar
中的最后一个值包含当前命令的参数。
回答by paddy
I haven't used PoDoFo, but a quick browse through the class hierarchy on their API webpage reveals:
我没有使用过 PoDoFo,但快速浏览他们 API 网页上的类层次结构会发现:
void PoDoFo::PdfMemDocument::Load( const char * pszFilename )
( API 文档链接)
So I would just hazard a guess here, that you do:
所以我在这里冒一个猜测,你会这样做:
PoDoFo::PdfMemDocument doc;
doc.Load( "somefile.pdf" );
Then I imagine you navigate the document tree by calling doc.GetObjects()
and walking through that array (see PdfDocument class)
然后我想象您通过调用doc.GetObjects()
并遍历该数组来导航文档树(请参阅 PdfDocument 类)