C# 在 .Net 中阅读 PDF 文档

Question

提问by JRoppert

Is there an open source library that will help me with reading/parsing PDF documents in .Net/C#?

是否有一个开源库可以帮助我阅读/解析 .Net/C# 中的 PDF 文档？

Answer 1

采纳答案by Brock Nusser

Since this question was last answered in 2008, iTextSharp has improved their api dramatically. If you download the latest version of their api from http://sourceforge.net/projects/itextsharp/, you can use the following snippet of code to extract all text from a pdf into a string.

自从上次回答这个问题是在 2008 年，iTextSharp 极大地改进了他们的 api。如果您从http://sourceforge.net/projects/itextsharp/下载最新版本的 api ，您可以使用以下代码片段将 pdf 中的所有文本提取为字符串。

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PdfParser
{
    public static class PdfTextExtractor
    {
        public static string pdfText(string path)
        {
            PdfReader reader = new PdfReader(path);
            string text = string.Empty;
            for(int page = 1; page <= reader.NumberOfPages; page++)
            {
                text += PdfTextExtractor.GetTextFromPage(reader,page);
            }
            reader.Close();
            return text;
        }   
    }
}

Answer 2

回答by Alex Fort

You could look into this: http://www.codeproject.com/KB/showcase/pdfrasterizer.aspxIt's not completely free, but it looks very nice.

你可以看看这个： http://www.codeproject.com/KB/showcase/pdfrasterizer.aspx这不是完全免费的，但它看起来很不错。

Alex

亚历克斯

Answer 3

回答by Ilya Kochetov

PDFClownmight help, but I would not recommend it for a big or heavy use application.

PDFClown可能会有所帮助，但我不建议将其用于大型或大量使用的应用程序。

Answer 4

回答by Cetra

There is also LibHaru

还有LibHaru

http://libharu.org/wiki/Main_Page

Answer 5

回答by Cetra

iText is the best library I know. Originally written in Java, there is a .NET port as well.

iText 是我所知道的最好的图书馆。最初是用 Java 编写的，还有一个 .NET 端口。

See http://www.ujihara.jp/iTextdotNET/en/

见http://www.ujihara.jp/iTextdotNET/en/

Answer 6

回答by Ben McEvoy

http://www.c-sharpcorner.com/UploadFile/psingh/PDFFileGenerator12062005235236PM/PDFFileGenerator.aspxis open source and may be a good starting point for you.

http://www.c-sharpcorner.com/UploadFile/psingh/PDFFileGenerator12062005235236PM/PDFFileGenerator.aspx是开源的，可能是您一个很好的起点。

Answer 7

回答by ceetheman

iTextSharpis the best bet. Used it to make a spider for lucene.Net so that it could crawl PDF.

iTextSharp是最好的选择。用它为 lucene.Net 制作蜘蛛，以便它可以抓取 PDF。

using System;
using System.IO;
using iTextSharp.text.pdf;
using System.Text.RegularExpressions;

namespace Spider.Utils
{
    /// <summary>
    /// Parses a PDF file and extracts the text from it.
    /// </summary>
    public class PDFParser
    {
        /// BT = Beginning of a text object operator 
        /// ET = End of a text object operator
        /// Td move to the start of next line
        ///  5 Ts = superscript
        /// -5 Ts = subscript

        #region Fields

        #region _numberOfCharsToKeep
        /// <summary>
        /// The number of characters to keep, when extracting text.
        /// </summary>
        private static int _numberOfCharsToKeep = 15;
        #endregion

        #endregion

        #region ExtractText
        /// <summary>
        /// Extracts a text from a PDF file.
        /// </summary>
        /// <param name="inFileName">the full path to the pdf file.</param>
        /// <param name="outFileName">the output file name.</param>
        /// <returns>the extracted text</returns>
        public bool ExtractText(string inFileName, string outFileName)
        {
            StreamWriter outFile = null;
            try
            {
                // Create a reader for the given PDF file
                PdfReader reader = new PdfReader(inFileName);
                //outFile = File.CreateText(outFileName);
                outFile = new StreamWriter(outFileName, false, System.Text.Encoding.UTF8);

                Console.Write("Processing: ");

                int totalLen = 68;
                float charUnit = ((float)totalLen) / (float)reader.NumberOfPages;
                int totalWritten = 0;
                float curUnit = 0;

                for (int page = 1; page <= reader.NumberOfPages; page++)
                {
                    outFile.Write(ExtractTextFromPDFBytes(reader.GetPageContent(page)) + " ");

                    // Write the progress.
                    if (charUnit >= 1.0f)
                    {
                        for (int i = 0; i < (int)charUnit; i++)
                        {
                            Console.Write("#");
                            totalWritten++;
                        }
                    }
                    else
                    {
                        curUnit += charUnit;
                        if (curUnit >= 1.0f)
                        {
                            for (int i = 0; i < (int)curUnit; i++)
                            {
                                Console.Write("#");
                                totalWritten++;
                            }
                            curUnit = 0;
                        }

                    }
                }

                if (totalWritten < totalLen)
                {
                    for (int i = 0; i < (totalLen - totalWritten); i++)
                    {
                        Console.Write("#");
                    }
                }
                return true;
            }
            catch
            {
                return false;
            }
            finally
            {
                if (outFile != null) outFile.Close();
            }
        }
        #endregion

        #region ExtractTextFromPDFBytes
        /// <summary>
        /// This method processes an uncompressed Adobe (text) object 
        /// and extracts text.
        /// </summary>
        /// <param name="input">uncompressed</param>
        /// <returns></returns>
        public string ExtractTextFromPDFBytes(byte[] input)
        {
            if (input == null || input.Length == 0) return "";

            try
            {
                string resultString = "";

                // Flag showing if we are we currently inside a text object
                bool inTextObject = false;

                // Flag showing if the next character is literal 
                // e.g. '\' to get a '\' character or '\(' to get '('
                bool nextLiteral = false;

                // () Bracket nesting level. Text appears inside ()
                int bracketDepth = 0;

                // Keep previous chars to get extract numbers etc.:
                char[] previousCharacters = new char[_numberOfCharsToKeep];
                for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' ';


                for (int i = 0; i < input.Length; i++)
                {
                    char c = (char)input[i];
                    if (input[i] == 213)
                        c = "'".ToCharArray()[0];

                    if (inTextObject)
                    {
                        // Position the text
                        if (bracketDepth == 0)
                        {
                            if (CheckToken(new string[] { "TD", "Td" }, previousCharacters))
                            {
                                resultString += "\n\r";
                            }
                            else
                            {
                                if (CheckToken(new string[] { "'", "T*", "\"" }, previousCharacters))
                                {
                                    resultString += "\n";
                                }
                                else
                                {
                                    if (CheckToken(new string[] { "Tj" }, previousCharacters))
                                    {
                                        resultString += " ";
                                    }
                                }
                            }
                        }

                        // End of a text object, also go to a new line.
                        if (bracketDepth == 0 &&
                            CheckToken(new string[] { "ET" }, previousCharacters))
                        {

                            inTextObject = false;
                            resultString += " ";
                        }
                        else
                        {
                            // Start outputting text
                            if ((c == '(') && (bracketDepth == 0) && (!nextLiteral))
                            {
                                bracketDepth = 1;
                            }
                            else
                            {
                                // Stop outputting text
                                if ((c == ')') && (bracketDepth == 1) && (!nextLiteral))
                                {
                                    bracketDepth = 0;
                                }
                                else
                                {
                                    // Just a normal text character:
                                    if (bracketDepth == 1)
                                    {
                                        // Only print out next character no matter what. 
                                        // Do not interpret.
                                        if (c == '\' && !nextLiteral)
                                        {
                                            resultString += c.ToString();
                                            nextLiteral = true;
                                        }
                                        else
                                        {
                                            if (((c >= ' ') && (c <= '~')) ||
                                                ((c >= 128) && (c < 255)))
                                            {
                                                resultString += c.ToString();
                                            }

                                            nextLiteral = false;
                                        }
                                    }
                                }
                            }
                        }
                    }

                    // Store the recent characters for 
                    // when we have to go back for a checking
                    for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
                    {
                        previousCharacters[j] = previousCharacters[j + 1];
                    }
                    previousCharacters[_numberOfCharsToKeep - 1] = c;

                    // Start of a text object
                    if (!inTextObject && CheckToken(new string[] { "BT" }, previousCharacters))
                    {
                        inTextObject = true;
                    }
                }

                return CleanupContent(resultString);
            }
            catch
            {
                return "";
            }
        }

        private string CleanupContent(string text)
        {
            string[] patterns = { @"\\(", @"\\)", @"\226", @"\222", @"\223", @"\224", @"\340", @"\342", @"\344", @"\300", @"\302", @"\304", @"\351", @"\350", @"\352", @"\353", @"\311", @"\310", @"\312", @"\313", @"\362", @"\364", @"\366", @"\322", @"\324", @"\326", @"\354", @"\356", @"\357", @"\314", @"\316", @"\317", @"\347", @"\307", @"\371", @"\373", @"\374", @"\331", @"\333", @"\334", @"\256", @"\231", @"\253", @"\273", @"\251", @"\221"};
            string[] replace = {   "(",     ")",      "-",     "'",      "\"",      "\"",    "à",      "a",      "?",      "à",      "?",      "?",      "é",      "è",      "ê",      "?",      "é",      "è",      "ê",      "?",      "ò",      "?",      "?",      "ò",      "?",      "?",      "ì",      "?",      "?",      "ì",      "?",      "?",      "?",      "?",      "ù",      "?",      "ü",      "ù",      "?",      "ü",      "?",      "?",      "?",      "?",      "?",      "'" };

            for (int i = 0; i < patterns.Length; i++)
            {
                string regExPattern = patterns[i];
                Regex regex = new Regex(regExPattern, RegexOptions.IgnoreCase);
                text = regex.Replace(text, replace[i]);
            }

            return text;
        }

        #endregion

        #region CheckToken
        /// <summary>
        /// Check if a certain 2 character token just came along (e.g. BT)
        /// </summary>
        /// <param name="tokens">the searched token</param>
        /// <param name="recent">the recent character array</param>
        /// <returns></returns>
        private bool CheckToken(string[] tokens, char[] recent)
        {
            foreach (string token in tokens)
            {
                if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
                    (recent[_numberOfCharsToKeep - 2] == token[1]) &&
                    ((recent[_numberOfCharsToKeep - 1] == ' ') ||
                    (recent[_numberOfCharsToKeep - 1] == 0x0d) ||
                    (recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
                    ((recent[_numberOfCharsToKeep - 4] == ' ') ||
                    (recent[_numberOfCharsToKeep - 4] == 0x0d) ||
                    (recent[_numberOfCharsToKeep - 4] == 0x0a))
                    )
                {
                    return true;
                }
            }
            return false;
        }
        #endregion
    }
}

Answer 8

回答by Kuvo

aspose pdfworks pretty well. then again, you have to pay for it

aspose pdf效果很好。再说一次，你必须为此付出代价

Answer 9

回答by ShravankumarKumar

public string ReadPdfFile(object Filename, DataTable ReadLibray)
{
    PdfReader reader2 = new PdfReader((string)Filename);
    string strText = string.Empty;

    for (int page = 1; page <= reader2.NumberOfPages; page++)
    {
    ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
    PdfReader reader = new PdfReader((string)Filename);
    String s = PdfTextExtractor.GetTextFromPage(reader, page, its);

    s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
    strText = strText + s;
    reader.Close();
    }
    return strText;
}

Answer 10

回答by Bobrovsky

Have a look at Docotic.Pdf library. It does not require you to make source code of your application open (like iTextSharp with viral AGPL 3 license, for example).

看看Docotic.Pdf 库。它不需要您打开应用程序的源代码（例如 iTextSharp 与病毒式 AGPL 3 许可证）。

Docotic.Pdf can be used to read PDF files and extract text with or without formatting. Please have a look at the sample that shows how to extract text from PDFs.

Docotic.Pdf 可用于阅读 PDF 文件并提取带或不带格式的文本。请查看显示如何从 PDF 中提取文本的示例。

Disclaimer: I work for Bit Miracle, vendor of the library.

免责声明：我为该库的供应商 Bit Miracle 工作。

C# 在 .Net 中阅读 PDF 文档

提问by JRoppert

采纳答案by Brock Nusser

回答by Alex Fort

回答by Ilya Kochetov

回答by Cetra

回答by Cetra

回答by Ben McEvoy

回答by ceetheman

回答by Kuvo

回答by ShravankumarKumar

回答by Bobrovsky

相关推荐

最近更新

标签

C# 在 .Net 中阅读 PDF 文档

提问by JRoppert

采纳答案by Brock Nusser

回答by Alex Fort

回答by Ilya Kochetov

回答by Cetra

回答by Cetra

回答by Ben McEvoy

回答by ceetheman

回答by Kuvo

回答by ShravankumarKumar

回答by Bobrovsky

相关推荐

C# 泛型有什么好处，为什么要使用它们？

C# .NET WebBrowser 控件中的阻止对话框

C# 从派生类自动调用 base.Dispose()

C# 深度克隆对象

相关推荐

最近更新

标签