如何使用 C# 将 PDF 转换为 HTML

Question

提问by Radhi

i have to read pdf and create html document... for uploaded cv in my site... i can not use any shareware. please can anybody suggest me the best solution for converting pdf to html... or read pdf content using C#

我必须阅读 pdf 并创建 html 文档...用于在我的网站上上传的简历...我不能使用任何共享软件。请有人建议我将pdf转换为html的最佳解决方案...或使用C#阅读pdf内容

site is developed in C#, asp.net 3.5

网站是用 C#、asp.net 3.5 开发的

Answer 1

采纳答案by Radhi

i got one code from here

我从这里得到了一个代码

and downlaoded itextsharp dll

并下载了itextsharp dll

and this code worked fine... only one problem i faced... that my most of files are converted into text or html or whatever frmat i want... but only 1 file i am not able to convert... if anybody can use this code and help me to find out what is the proble... i'll be thankful to him/her.

并且这段代码运行良好......我遇到的只有一个问题......我的大部分文件被转换为文本或html或我想要的任何frmat......但只有1个文件我无法转换......如果有人可以使用此代码并帮助我找出问题所在......我会感谢他/她。

you can see code here...

你可以在这里看到代码...

using System;
using System.IO;
using iTextSharp.text.pdf;
using System.Text.RegularExpressions;

namespace PDFReader
{
    /// <summary>
    /// Parses a PDF file and extracts the text from it.
    /// </summary>
    public class PDFParser
    {
        /// BT = Beginning of a text object operator 
        /// ET = End of a text object operator
        /// Td move to the start of next line
        ///  5 Ts = superscript
        /// -5 Ts = subscript

        #region Fields

        #region _numberOfCharsToKeep
        /// <summary>
        /// The number of characters to keep, when extracting text.
        /// </summary>
        private static int _numberOfCharsToKeep = 15;
        #endregion

        #endregion

        #region ExtractText
        /// <summary>
        /// Extracts a text from a PDF file.
        /// </summary>
        /// <param name="inFileName">the full path to the pdf file.</param>
        /// <param name="outFileName">the output file name.</param>
        /// <returns>the extracted text</returns>
        public bool ExtractText(string inFileName, string outFileName)
        {
            StreamWriter outFile = null;
            try
            {
                // Create a reader for the given PDF file
                PdfReader reader = new PdfReader(inFileName);
                //outFile = File.CreateText(outFileName);
                outFile = new StreamWriter(outFileName, false, System.Text.Encoding.UTF8);

                Console.Write("Processing: ");

                int totalLen = 68;
                float charUnit = ((float)totalLen) / (float)reader.NumberOfPages;
                int totalWritten = 0;
                float curUnit = 0;

                for (int page = 1; page <= reader.NumberOfPages; page++)
                {
                    outFile.Write(ExtractTextFromPDFBytes(reader.GetPageContent(page)) + " ");

                    // Write the progress.
                    if (charUnit >= 1.0f)
                    {
                        for (int i = 0; i < (int)charUnit; i++)
                        {
                            Console.Write("#");
                            totalWritten++;
                        }
                    }
                    else
                    {
                        curUnit += charUnit;
                        if (curUnit >= 1.0f)
                        {
                            for (int i = 0; i < (int)curUnit; i++)
                            {
                                Console.Write("#");
                                totalWritten++;
                            }
                            curUnit = 0;
                        }

                    }
                }

                if (totalWritten < totalLen)
                {
                    for (int i = 0; i < (totalLen - totalWritten); i++)
                    {
                        Console.Write("#");
                    }
                }
                return true;
            }
            catch
            {
                return false;
            }
            finally
            {
                if (outFile != null) outFile.Close();
            }
        }
        #endregion

        #region ExtractTextFromPDFBytes
        /// <summary>
        /// This method processes an uncompressed Adobe (text) object 
        /// and extracts text.
        /// </summary>
        /// <param name="input">uncompressed</param>
        /// <returns></returns>
        public string ExtractTextFromPDFBytes(byte[] input)
        {
            if (input == null || input.Length == 0) return "";

            try
            {
                string resultString = "";

                // Flag showing if we are we currently inside a text object
                bool inTextObject = false;

                // Flag showing if the next character is literal 
                // e.g. '\' to get a '\' character or '\(' to get '('
                bool nextLiteral = false;

                // () Bracket nesting level. Text appears inside ()
                int bracketDepth = 0;

                // Keep previous chars to get extract numbers etc.:
                char[] previousCharacters = new char[_numberOfCharsToKeep];
                for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' ';


                for (int i = 0; i < input.Length; i++)
                {
                    char c = (char)input[i];
                    if (input[i] == 213)
                        c = "'".ToCharArray()[0];

                    if (inTextObject)
                    {
                        // Position the text
                        if (bracketDepth == 0)
                        {
                            if (CheckToken(new string[] { "TD", "Td" }, previousCharacters))
                            {
                                resultString += "\n\r";
                            }
                            else
                            {
                                if (CheckToken(new string[] { "'", "T*", "\"" }, previousCharacters))
                                {
                                    resultString += "\n";
                                }
                                else
                                {
                                    if (CheckToken(new string[] { "Tj" }, previousCharacters))
                                    {
                                        resultString += " ";
                                    }
                                }
                            }
                        }

                        // End of a text object, also go to a new line.
                        if (bracketDepth == 0 &&
                            CheckToken(new string[] { "ET" }, previousCharacters))
                        {

                            inTextObject = false;
                            resultString += " ";
                        }
                        else
                        {
                            // Start outputting text
                            if ((c == '(') && (bracketDepth == 0) && (!nextLiteral))
                            {
                                bracketDepth = 1;
                            }
                            else
                            {
                                // Stop outputting text
                                if ((c == ')') && (bracketDepth == 1) && (!nextLiteral))
                                {
                                    bracketDepth = 0;
                                }
                                else
                                {
                                    // Just a normal text character:
                                    if (bracketDepth == 1)
                                    {
                                        // Only print out next character no matter what. 
                                        // Do not interpret.
                                        if (c == '\' && !nextLiteral)
                                        {
                                            resultString += c.ToString();
                                            nextLiteral = true;
                                        }
                                        else
                                        {
                                            if (((c >= ' ') && (c <= '~')) ||
                                                ((c >= 128) && (c < 255)))
                                            {
                                                resultString += c.ToString();
                                            }

                                            nextLiteral = false;
                                        }
                                    }
                                }
                            }
                        }
                    }

                    // Store the recent characters for 
                    // when we have to go back for a checking
                    for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
                    {
                        previousCharacters[j] = previousCharacters[j + 1];
                    }
                    previousCharacters[_numberOfCharsToKeep - 1] = c;

                    // Start of a text object
                    if (!inTextObject && CheckToken(new string[] { "BT" }, previousCharacters))
                    {
                        inTextObject = true;
                    }
                }

                return CleanupContent(resultString);
            }
            catch
            {
                return "";
            }
        }

        private string CleanupContent(string text)
        {
            string[] patterns = { @"\\(", @"\\)", @"\226", @"\222", @"\223", @"\224", @"\340", @"\342", @"\344", @"\300", @"\302", @"\304", @"\351", @"\350", @"\352", @"\353", @"\311", @"\310", @"\312", @"\313", @"\362", @"\364", @"\366", @"\322", @"\324", @"\326", @"\354", @"\356", @"\357", @"\314", @"\316", @"\317", @"\347", @"\307", @"\371", @"\373", @"\374", @"\331", @"\333", @"\334", @"\256", @"\231", @"\253", @"\273", @"\251", @"\221" };
            string[] replace = { "(", ")", "-", "'", "\"", "\"", "à", "a", "?", "à", "?", "?", "é", "è", "ê", "?", "é", "è", "ê", "?", "ò", "?", "?", "ò", "?", "?", "ì", "?", "?", "ì", "?", "?", "?", "?", "ù", "?", "ü", "ù", "?", "ü", "?", "?", "?", "?", "?", "'" };

            for (int i = 0; i < patterns.Length; i++)
            {
                string regExPattern = patterns[i];
                Regex regex = new Regex(regExPattern, RegexOptions.IgnoreCase);
                text = regex.Replace(text, replace[i]);
            }

            return text;
        }

        #endregion

        #region CheckToken
        /// <summary>
        /// Check if a certain 2 character token just came along (e.g. BT)
        /// </summary>
        /// <param name="tokens">the searched token</param>
        /// <param name="recent">the recent character array</param>
        /// <returns></returns>
        private bool CheckToken(string[] tokens, char[] recent)
        {
            try
            {
                foreach (string token in tokens)
                {
                    if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
                        (recent[_numberOfCharsToKeep - 2] == token[1]) &&
                        ((recent[_numberOfCharsToKeep - 1] == ' ') ||
                        (recent[_numberOfCharsToKeep - 1] == 0x0d) ||
                        (recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
                        ((recent[_numberOfCharsToKeep - 4] == ' ') ||
                        (recent[_numberOfCharsToKeep - 4] == 0x0d) ||
                        (recent[_numberOfCharsToKeep - 4] == 0x0a))
                        )
                    {
                        return true;
                    }
                }
            }
            catch (Exception)
            {
              return true;
            }
            return false;
        }
        #endregion
    }
}

i am getting error "Index out of range" in function

我在函数中收到错误“索引超出范围”

CheckToken

检查令牌

Answer 2

回答by Galwegian

You could use something like pdftotextas outlined in the following article: How To Convert Pdf file to text in asp.net

您可以使用以下文章中概述的pdftotext之类的内容：How To Convert Pdf file to text in asp.net

Answer 3

回答by Thorsten79

Depends on what you want to do. Converting pdfs to plain text without formatting can be done with pdftotext or similar solutions. Converting PDF layout to HTML layout is very hard because the PDF design philosophy is very different from how HTML layouting works. Google has some sort of solution for it but it will usually break layout.

取决于你想做什么。可以使用 pdftotext 或类似解决方案将 pdf 转换为无格式的纯文本。将 PDF 布局转换为 HTML 布局非常困难，因为 PDF 设计理念与 HTML 布局的工作方式大不相同。谷歌有一些解决方案，但它通常会破坏布局。

Regarding your CV concept: As CV layout is highly important for customers using a site I would not want to auto-convert PDF CVs to HTML CVs. What pdftotext canoffer you is a plain text where a CV search engine can find parts of the CV.

关于您的简历概念：由于简历布局对于使用网站的客户非常重要，我不想将 PDF 简历自动转换为 HTML 简历。pdftotext可以为您提供的是纯文本，简历搜索引擎可以在其中找到部分简历。

Answer 4

回答by necrostaz

you may convert pdf to image with ghostscript and c# and then publish image on site. for details see this article

您可以使用 ghostscript 和 c# 将 pdf 转换为图像，然后在网站上发布图像。详情请看这篇文章

如何使用 C# 将 PDF 转换为 HTML

提问by Radhi

采纳答案by Radhi

回答by Galwegian

回答by Thorsten79

回答by necrostaz

相关推荐

最近更新

标签

如何使用 C# 将 PDF 转换为 HTML

提问by Radhi

采纳答案by Radhi

回答by Galwegian

回答by Thorsten79

回答by necrostaz

相关推荐

C# 试着抓。以相同的方式处理多个异常（或失败）

Linux 获取shell脚本的pid并将其保存到锁定文件中

无法在 Linux Mint 15 中编译简单的 c 程序

C# 从本地文件夹读取文本文件

相关推荐

最近更新

标签