如何在c#中以编程方式搜索PDF文档

Question

提问by Nathan

I have a need to search a pdf file to see if a certain string is present. The string in question is definitely encoded as text (ie. it is not an image or anything). I have tried just searching the file as though it was plain text, but this does not work.

我需要搜索 pdf 文件以查看是否存在某个字符串。有问题的字符串肯定被编码为文本（即它不是图像或任何东西）。我试过只搜索文件，就好像它是纯文本一样，但这不起作用。

Is it possible to do this? Are there any librarys out there for .net2.0 that will extract/decode all the text out of pdf file for me?

是否有可能做到这一点？是否有任何 .net2.0 库可以为我从 pdf 文件中提取/解码所有文本？

Answer 1

采纳答案by volatilsis

There are a few libraries available out there. Check out http://www.codeproject.com/KB/cs/PDFToText.aspxand http://itextsharp.sourceforge.net/

有一些可用的库。查看http://www.codeproject.com/KB/cs/PDFToText.aspx和http://itextsharp.sourceforge.net/

It takes a little bit of effort but it's possible.

这需要一点努力，但这是可能的。

Answer 2

回答by Rowan

In the vast majority of cases, it's not possible to search the contents of a PDF directly by opening it up in notepad -- and even in the minority of cases (depending on how the PDF was constructed), you'll only ever be able search for individual words due to the way that PDF handles text internally.

在绝大多数情况下，无法通过在记事本中打开 PDF 来直接搜索它的内容——即使在少数情况下（取决于 PDF 的构建方式），您也只能由于 PDF 在内部处理文本的方式，搜索单个单词。

My company has a commercial solution that will let you extract text from a PDF file. I've included some sample code for you below, as shown on this page, that demonstrates how to search through the text from a PDF file for a particular string.

我的公司有一个商业解决方案，可以让您从 PDF 文件中提取文本。我在下面为您提供了一些示例代码，如本页所示，演示了如何从 PDF 文件中搜索特定字符串的文本。

using System;
using System.IO;
using QuickPDFDLL0718;

namespace QPLConsoleApp
{
    public class QPL
    {
        public static void Main()
        {
            // This example uses the DLL edition of Quick PDF Library
            // Create an instance of the class and give it the path to the DLL
            PDFLibrary QP = new PDFLibrary("QuickPDFDLL0718.dll");

            // Check if the DLL was loaded successfully
            if (QP.LibraryLoaded())
            {
                // Insert license key here / Check the license key
                if (QP.UnlockKey("...") == 1)
                {
                    QP.LoadFromFile(@"C:\Program Files\Quick PDF Library\DLL\GettingStarted.pdf");

                    int iPageCount = QP.PageCount();
                    int PageNumber = 1;
                    int MatchesFound = 0;

                    while (PageNumber <= iPageCount)
                    {
                        QP.SelectPage(PageNumber);
                        string PageText = QP.GetPageText(3);

                        using (StreamWriter TempFile = new StreamWriter(QP.GetTempPath() + "temp" + PageNumber + ".txt"))
                        {
                            TempFile.Write(PageText);
                        }

                        string[] lines = File.ReadAllLines(QP.GetTempPath() + "temp" + PageNumber + ".txt");
                        string[][] grid = new string[lines.Length][];

                        for (int i = 0; i < lines.Length; i++)
                        {
                            grid[i] = lines[i].Split(',');
                        }

                        foreach (string[] line in grid)
                        {
                            string FindMatch = line[11];

                            // Update this string to the word that you're searching for.
                            // It can be one or more words (i.e. "sunday" or "last sunday".

                            if (FindMatch.Contains("characters"))
                            {
                                Console.WriteLine("Success! Word match found on page: " + PageNumber);
                                MatchesFound++;
                            }
                        }
                        PageNumber++;
                    }

                    if (MatchesFound == 0)
                    {
                        Console.WriteLine("Sorry! No matches found.");
                    }
                    else
                    {
                        Console.WriteLine();
                        Console.WriteLine("Total: " + MatchesFound + " matches found!");
                    }
                    Console.ReadLine();
                }
            }
        }
    }
}

Answer 3

回答by Bobrovsky

You can use Docotic.Pdf libraryto search for text in PDF files.

您可以使用Docotic.Pdf 库来搜索 PDF 文件中的文本。

Here is a sample code:

这是一个示例代码：

static void searchForText(string path, string text)
{
    using (PdfDocument pdf = new PdfDocument(path))
    {
        for (int i = 0; i < pdf.Pages.Count; i++)
        {
            string pageText = pdf.Pages[i].GetText();
            int index = pageText.IndexOf(text, 0, StringComparison.CurrentCultureIgnoreCase);
            if (index != -1)
                Console.WriteLine("'{0}' found on page {1}", text, i);
        }
    }
}

The library can also extract formatted and plain textfrom the whole document or any document page.

该库还可以从整个文档或任何文档页面中提取格式化文本和纯文本。

Disclaimer: I work for Bit Miracle, vendor of the library.

免责声明：我为该库的供应商 Bit Miracle 工作。

如何在c#中以编程方式搜索PDF文档

提问by Nathan

采纳答案by volatilsis

回答by Rowan

回答by Bobrovsky

相关推荐

最近更新

标签

如何在c#中以编程方式搜索PDF文档

提问by Nathan

采纳答案by volatilsis

回答by Rowan

回答by Bobrovsky

相关推荐

C#：在 KeyDown 事件中，我应该使用什么来检查哪个键被按下？

C# 将通用列表/枚举转换为数据表？

C# 是否可以将属性作为“out”或“ref”参数传递？

C# 什么是 Environment.FailFast？

相关推荐

最近更新

标签