C# PDFSharp:如何从 PDF 中去除文本的示例?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9591992/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
C# PDFSharp: Examples of how to strip text from PDF?
提问by I Z
I have a fairly simple task: I need to read a PDF file and write out its image contents while ignoring its text contents. So essentially I need to do the complement of "save as text".
我有一个相当简单的任务:我需要阅读一个 PDF 文件并写出它的图像内容,同时忽略它的文本内容。所以基本上我需要做“另存为文本”的补充。
Ideally, I would prefer to avoid any sort of re-compression of the image contents but if it's not possible, it's ok too.
理想情况下,我宁愿避免对图像内容进行任何类型的重新压缩,但如果不可能,也没关系。
Are the examples of how to do it?
是如何做的例子吗?
Thanks!
谢谢!
采纳答案by I liked the old Stack Overflow
Extracting text from a PDF file with PDFsharp is not a simple task.
使用 PDFsharp 从 PDF 文件中提取文本并不是一项简单的任务。
It was discussed recently in this thread: https://stackoverflow.com/a/9161732/162529
最近在这个线程中讨论过:https: //stackoverflow.com/a/9161732/162529
回答by Mariusz
Example of PDFSharp libraries extracting images from .pdf file:
PDFSharp 库从 .pdf 文件中提取图像的示例:
EDIT:
编辑:
Then if you want to extract text from image you have to use OCR libraries.
然后,如果您想从图像中提取文本,则必须使用 OCR 库。
There are two good OCRs tessnetand MODI
Link to thread on stack
But I fully can recommend MODIwhich I am using now. Some sample @ codeproject.
有两个很好的 OCR tessnet和 MODI
Link to thread on stack
但我完全可以推荐我现在正在使用的MODI。一些示例@codeproject。
EDIT 2 :
编辑 2:
If you don't want to read text from extracted images, you should write new PDF document and put all of them into it. For writing PDFs I use MigraDoc. It is not difficult to use that library.
如果您不想从提取的图像中读取文本,您应该编写新的 PDF 文档并将它们全部放入其中。为了编写 PDF,我使用MigraDoc。使用那个库并不难。
回答by Mason
Extracting text from a PDF with PdfSharp can actually be very easy, depending on the document type and what you intend to do with it. If the text is in the document as text, and not an image, and you don't care about the position or format, then it's quite simple. This code gets all of the text of the first page in the PDFs I'm working with:
使用 PdfSharp 从 PDF 中提取文本实际上非常容易,具体取决于文档类型和您打算用它做什么。如果文本在文档中作为文本而不是图像,并且您不关心位置或格式,那么它很简单。此代码获取我正在使用的 PDF 中第一页的所有文本:
var doc = PdfReader.Open(docPath);
string pageText = doc.Pages[0].Contents.Elements.GetDictionary(0).Stream.ToString();
doc.Pages.Countgives you the total number of pages, and you access each one through the doc.Pagesarray with the index. I don't recommend using foreachand Linq here, as the interfaces aren't implemented well. The index passed into GetDictionaryis for which PDF document element - this may vary based on how the documents are produced. If you don't get the text you're looking for, try looping through all of the elements.
doc.Pages.Count为您提供总页数,您可以通过doc.Pages带有索引的数组访问每一页。我不建议foreach在这里使用和 Linq,因为接口没有很好地实现。传入的索引GetDictionary是针对哪个 PDF 文档元素 - 这可能会因文档的生成方式而异。如果没有得到您要查找的文本,请尝试遍历所有元素。
The text that this produces will be full of various PDF formatting codes. If all you need to do is extract strings, though, you can find the ones you want using Regex or any other appropriate string searching code. If you need to do anything with the formatting or positioning, then good luck - from what I can tell, you'll need it.
这产生的文本将充满各种 PDF 格式代码。但是,如果您需要做的只是提取字符串,则可以使用 Regex 或任何其他适当的字符串搜索代码找到您想要的字符串。如果您需要对格式或定位做任何事情,那么祝您好运-据我所知,您将需要它。

