在 C# 中从 PDF 中提取文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2116440/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 23:38:34  来源:igfitidea点击:

Extracting text from PDFs in C#

c#pdftextextract

提问by Duncan Tait

Pretty simply, I need to rip text out of multiple PDFs (quite a lot actually) in order to analyse the contents before sticking it in an SQL database.

很简单,我需要从多个 PDF 中提取文本(实际上相当多),以便在将其粘贴到 SQL 数据库之前分析内容。

I've found some pretty sketchy free C# libraries that sort of work (the best one uses iTextSharp), but there are umpteen formatting errors and some characters are scrambled and alot of the time there are spaces (' ') EVERYWHERE - inside words, between every letter, huge blocks of them taking up several lines, it all seems a bit random.

我发现了一些非常粗略的免费 C# 库,它们可以工作(最好的库使用 iTextSharp),但是有无数的格式错误和一些字符被打乱,而且很多时候到处都有空格 (' ') - 字里行间,在每个字母之间,它们的大块占据了几行,这似乎有点随机。

Is there any easy way of doing this that I'm completely overlooking (quite likely!) or is it a bit of an arduous task that involves converting the extracted byte values into letters reliably?

有没有什么简单的方法可以做到这一点,我完全忽略了(很可能!),或者是一项艰巨的任务,涉及将提取的字节值可靠地转换为字母?

回答by Tarydon

There may be some difficulty in doing this reliably. The problem is that PDF is a presentationformat which attaches importance to good typography. Suppose you just wanted to output a single word: Tap.

可靠地执行此操作可能存在一些困难。问题在于PDF是一种重视良好排版的演示格式。假设您只想输出一个单词:点击

A PDF rendering engine might output this as 2 separate calls, as shown in this pseudo-code:

PDF 渲染引擎可能会将其输出为 2 个单独的调用,如以下伪代码所示:

moveto (x1, y); output ("T")
moveto (x2, y); output ("ap")

This would be done because the default kerning(inter-letter spacing) between the letters T and a might not be acceptable to the rendering engine, or it might be adding or removing some micro space between characters to get a fully justified line. What this finally results in is that the actual text fragments found in PDF are very often not full words, but pieces of them.

这样做是因为字母 T 和 a 之间的默认字距调整(字母间距)可能不被渲染引擎接受,或者它可能会在字符之间添加或删除一些微空间以获得完全对齐的行。这最终导致在 PDF 中找到的实际文本片段通常不是完整的单词,而是它们的一部分。

回答by Bobrovsky

You can try Docotic.Pdf library(disclaimer: I work for Bit Miracle) to extract text from PDF files. The library uses some heuristics to extract nice looking text without unwanted spaces between letters in words.

您可以尝试Docotic.Pdf 库(免责声明:我为 Bit Miracle 工作)从 PDF 文件中提取文本。该库使用一些启发式方法来提取漂亮的文本,而单词中的字母之间没有不需要的空格。

Please take a look at a sample that shows how to extract text from PDF.

请查看一个示例,该示例展示了如何从 PDF 中提取文本

回答by Tony Qu

You can try Toxy, a text/data extraction framework in .NET. In Toxy 1.0, PDF will be supported. For detail, please visit http://toxy.codeplex.com

您可以尝试使用 .NET 中的文本/数据提取框架 Toxy。在 Toxy 1.0 中,将支持 PDF。详情请访问http://toxy.codeplex.com

回答by Jussi Palo

If you're looking for "free" alternative, check out PDF Clown. I personally have used iFilter based approach, and it seems to work fine in case you would need to support other file types easily. Sample code here.

如果您正在寻找“免费”替代品,请查看PDF Clown。我个人使用过基于 iFilter 的方法,如果您需要轻松支持其他文件类型,它似乎可以正常工作。示例代码在这里

回答by David Hammond

Take a look at Tika on DotNet, available through Nuget: https://www.nuget.org/packages/TikaOnDotnet.TextExtractor/

在 DotNet 上查看 Tika,可通过 Nuget 获取:https://www.nuget.org/packages/TikaOnDotnet.TextExtractor/

This is a wrapper around the extremely good Tika java library, using IKVM. Very easy to use and handles a wide variety of file types other than PDF, including old and new office formats. It will auto-select the parser based on the file extension, so it's as easy as:

这是一个使用 IKVM 的非常好的 Tika java 库的包装器。非常易于使用并处理除 PDF 之外的各种文件类型,包括新旧办公格式。它将根据文件扩展名自动选择解析器,因此很简单:

var text = new TextExtractor().Extract(file.FullName).Text;

Update:One caution with this solution is that development on IKVM has ended. I'm not sure what this will mean in the long run. http://weblog.ikvm.net/2017/04/21/TheEndOfIKVMNET.aspx

更新:此解决方案的一个警告是 IKVM 上的开发已经结束。从长远来看,我不确定这意味着什么。http://weblog.ikvm.net/2017/04/21/TheEndOfIKVMNET.aspx

回答by Eugene

In case you are processing PDF files with the purpose of importing data into a database then I suggest to consider ByteScout PDF Extractor SDK. Some useful functions included are

如果您处理 PDF 文件的目的是将数据导入数据库,那么我建议考虑ByteScout PDF Extractor SDK。包括的一些有用的功能是

  • table detection;
  • text extraction as CSV, XML or formatted text (with the optional layout restoration);
  • text search with support for regular expressions;
  • low-level API to access text objects
  • 表检测;
  • 文本提取为 CSV、XML 或格式化文本(带有可选的布局恢复);
  • 支持正则表达式的文本搜索;
  • 访问文本对象的低级 API

DISCLAIMER: I'm affiliated with ByteScout

免责声明:我隶属于 ByteScout