C# 使用 PdfSharp 从 PDF 中提取文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10141143/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
C# Extract text from PDF using PdfSharp
提问by der_chirurg
Is there a possibility to extract plain text from a PDF-File with PdfSharp? I don't want to use iTextSharp because of its license.
是否有可能使用 PdfSharp 从 PDF 文件中提取纯文本?我不想使用 iTextSharp 因为它的许可证。
回答by David Schmitt
PDFSharp provides all the tools to extract the text from a PDF. Use the ContentReaderclass to access the commands within each page and extract the strings from TJ/Tj operators.
PDFSharp 提供了从 PDF 中提取文本的所有工具。使用ContentReader该类访问每个页面中的命令并从 TJ/Tj 运算符中提取字符串。
I've uploaded a simple implementation to github.
我已经上传了一个简单的实现到github。
回答by Sergio
I have implemented it somehow similar to how David did it. Here is my code:
我已经以某种类似于大卫的方式实现了它。这是我的代码:
{
// ....
var page = document.Pages[1];
CObject content = ContentReader.ReadContent(page);
var extractedText = ExtractText(content);
// ...
}
private IEnumerable<string> ExtractText(CObject cObject )
{
var textList = new List<string>();
if (cObject is COperator)
{
var cOperator = cObject as COperator;
if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
{
textList.AddRange(ExtractText(cOperand));
}
}
}
else if (cObject is CSequence)
{
var cSequence = cObject as CSequence;
foreach (var element in cSequence)
{
textList.AddRange(ExtractText(element));
}
}
else if (cObject is CString)
{
var cString = cObject as CString;
textList.Add(cString.Value);
}
return textList;
}
回答by Ronnie Overby
Took Sergio's answer and made some extension methods. I also changed the accumulation of strings into an iterator.
拿了塞尔吉奥的回答,做了一些扩展方法。我还将字符串的累积更改为迭代器。
public static class PdfSharpExtensions
{
public static IEnumerable<string> ExtractText(this PdfPage page)
{
var content = ContentReader.ReadContent(page);
var text = content.ExtractText();
return text;
}
public static IEnumerable<string> ExtractText(this CObject cObject)
{
if (cObject is COperator)
{
var cOperator = cObject as COperator;
if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
foreach (var txt in ExtractText(cOperand))
yield return txt;
}
}
else if (cObject is CSequence)
{
var cSequence = cObject as CSequence;
foreach (var element in cSequence)
foreach (var txt in ExtractText(element))
yield return txt;
}
else if (cObject is CString)
{
var cString = cObject as CString;
yield return cString.Value;
}
}
}

