如何在c#中以编程方式搜索PDF文档
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/567951/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to programmatically search a PDF document in c#
提问by Nathan
I have a need to search a pdf file to see if a certain string is present. The string in question is definitely encoded as text (ie. it is not an image or anything). I have tried just searching the file as though it was plain text, but this does not work.
我需要搜索 pdf 文件以查看是否存在某个字符串。有问题的字符串肯定被编码为文本(即它不是图像或任何东西)。我试过只搜索文件,就好像它是纯文本一样,但这不起作用。
Is it possible to do this? Are there any librarys out there for .net2.0 that will extract/decode all the text out of pdf file for me?
是否有可能做到这一点?是否有任何 .net2.0 库可以为我从 pdf 文件中提取/解码所有文本?
采纳答案by volatilsis
There are a few libraries available out there. Check out http://www.codeproject.com/KB/cs/PDFToText.aspxand http://itextsharp.sourceforge.net/
有一些可用的库。查看http://www.codeproject.com/KB/cs/PDFToText.aspx和http://itextsharp.sourceforge.net/
It takes a little bit of effort but it's possible.
这需要一点努力,但这是可能的。
回答by Rowan
In the vast majority of cases, it's not possible to search the contents of a PDF directly by opening it up in notepad -- and even in the minority of cases (depending on how the PDF was constructed), you'll only ever be able search for individual words due to the way that PDF handles text internally.
在绝大多数情况下,无法通过在记事本中打开 PDF 来直接搜索它的内容——即使在少数情况下(取决于 PDF 的构建方式),您也只能由于 PDF 在内部处理文本的方式,搜索单个单词。
My company has a commercial solution that will let you extract text from a PDF file. I've included some sample code for you below, as shown on this page, that demonstrates how to search through the text from a PDF file for a particular string.
我的公司有一个商业解决方案,可以让您从 PDF 文件中提取文本。我在下面为您提供了一些示例代码,如本页所示,演示了如何从 PDF 文件中搜索特定字符串的文本。
using System;
using System.IO;
using QuickPDFDLL0718;
namespace QPLConsoleApp
{
public class QPL
{
public static void Main()
{
// This example uses the DLL edition of Quick PDF Library
// Create an instance of the class and give it the path to the DLL
PDFLibrary QP = new PDFLibrary("QuickPDFDLL0718.dll");
// Check if the DLL was loaded successfully
if (QP.LibraryLoaded())
{
// Insert license key here / Check the license key
if (QP.UnlockKey("...") == 1)
{
QP.LoadFromFile(@"C:\Program Files\Quick PDF Library\DLL\GettingStarted.pdf");
int iPageCount = QP.PageCount();
int PageNumber = 1;
int MatchesFound = 0;
while (PageNumber <= iPageCount)
{
QP.SelectPage(PageNumber);
string PageText = QP.GetPageText(3);
using (StreamWriter TempFile = new StreamWriter(QP.GetTempPath() + "temp" + PageNumber + ".txt"))
{
TempFile.Write(PageText);
}
string[] lines = File.ReadAllLines(QP.GetTempPath() + "temp" + PageNumber + ".txt");
string[][] grid = new string[lines.Length][];
for (int i = 0; i < lines.Length; i++)
{
grid[i] = lines[i].Split(',');
}
foreach (string[] line in grid)
{
string FindMatch = line[11];
// Update this string to the word that you're searching for.
// It can be one or more words (i.e. "sunday" or "last sunday".
if (FindMatch.Contains("characters"))
{
Console.WriteLine("Success! Word match found on page: " + PageNumber);
MatchesFound++;
}
}
PageNumber++;
}
if (MatchesFound == 0)
{
Console.WriteLine("Sorry! No matches found.");
}
else
{
Console.WriteLine();
Console.WriteLine("Total: " + MatchesFound + " matches found!");
}
Console.ReadLine();
}
}
}
}
}
回答by Bobrovsky
You can use Docotic.Pdf libraryto search for text in PDF files.
您可以使用Docotic.Pdf 库来搜索 PDF 文件中的文本。
Here is a sample code:
这是一个示例代码:
static void searchForText(string path, string text)
{
using (PdfDocument pdf = new PdfDocument(path))
{
for (int i = 0; i < pdf.Pages.Count; i++)
{
string pageText = pdf.Pages[i].GetText();
int index = pageText.IndexOf(text, 0, StringComparison.CurrentCultureIgnoreCase);
if (index != -1)
Console.WriteLine("'{0}' found on page {1}", text, i);
}
}
}
The library can also extract formatted and plain textfrom the whole document or any document page.
该库还可以从整个文档或任何文档页面中提取格式化文本和纯文本。
Disclaimer: I work for Bit Miracle, vendor of the library.
免责声明:我为该库的供应商 Bit Miracle 工作。