如何使用 C# 将 PDF 转换为 HTML

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2295555/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-07 01:11:14  来源:igfitidea点击:

how to convert PDF into HTML using C#

c#asp.netpdf

提问by Radhi

i have to read pdf and create html document... for uploaded cv in my site... i can not use any shareware. please can anybody suggest me the best solution for converting pdf to html... or read pdf content using C#

我必须阅读 pdf 并创建 html 文档...用于在我的网站上上传的简历...我不能使用任何共享软件。请有人建议我将pdf转换为html的最佳解决方案...或使用C#阅读pdf内容

site is developed in C#, asp.net 3.5

网站是用 C#、asp.net 3.5 开发的

采纳答案by Radhi

i got one code from here

我从这里得到了一个代码

and downlaoded itextsharp dll

并下载了itextsharp dll

and this code worked fine... only one problem i faced... that my most of files are converted into text or html or whatever frmat i want... but only 1 file i am not able to convert... if anybody can use this code and help me to find out what is the proble... i'll be thankful to him/her.

并且这段代码运行良好......我遇到的只有一个问题......我的大部分文件被转换为文本或html或我想要的任何frmat......但只有1个文件我无法转换......如果有人可以使用此代码并帮助我找出问题所在......我会感谢他/她。

you can see code here...

你可以在这里看到代码...

using System;
using System.IO;
using iTextSharp.text.pdf;
using System.Text.RegularExpressions;

namespace PDFReader
{
    /// <summary>
    /// Parses a PDF file and extracts the text from it.
    /// </summary>
    public class PDFParser
    {
        /// BT = Beginning of a text object operator 
        /// ET = End of a text object operator
        /// Td move to the start of next line
        ///  5 Ts = superscript
        /// -5 Ts = subscript

        #region Fields

        #region _numberOfCharsToKeep
        /// <summary>
        /// The number of characters to keep, when extracting text.
        /// </summary>
        private static int _numberOfCharsToKeep = 15;
        #endregion

        #endregion

        #region ExtractText
        /// <summary>
        /// Extracts a text from a PDF file.
        /// </summary>
        /// <param name="inFileName">the full path to the pdf file.</param>
        /// <param name="outFileName">the output file name.</param>
        /// <returns>the extracted text</returns>
        public bool ExtractText(string inFileName, string outFileName)
        {
            StreamWriter outFile = null;
            try
            {
                // Create a reader for the given PDF file
                PdfReader reader = new PdfReader(inFileName);
                //outFile = File.CreateText(outFileName);
                outFile = new StreamWriter(outFileName, false, System.Text.Encoding.UTF8);

                Console.Write("Processing: ");

                int totalLen = 68;
                float charUnit = ((float)totalLen) / (float)reader.NumberOfPages;
                int totalWritten = 0;
                float curUnit = 0;

                for (int page = 1; page <= reader.NumberOfPages; page++)
                {
                    outFile.Write(ExtractTextFromPDFBytes(reader.GetPageContent(page)) + " ");

                    // Write the progress.
                    if (charUnit >= 1.0f)
                    {
                        for (int i = 0; i < (int)charUnit; i++)
                        {
                            Console.Write("#");
                            totalWritten++;
                        }
                    }
                    else
                    {
                        curUnit += charUnit;
                        if (curUnit >= 1.0f)
                        {
                            for (int i = 0; i < (int)curUnit; i++)
                            {
                                Console.Write("#");
                                totalWritten++;
                            }
                            curUnit = 0;
                        }

                    }
                }

                if (totalWritten < totalLen)
                {
                    for (int i = 0; i < (totalLen - totalWritten); i++)
                    {
                        Console.Write("#");
                    }
                }
                return true;
            }
            catch
            {
                return false;
            }
            finally
            {
                if (outFile != null) outFile.Close();
            }
        }
        #endregion

        #region ExtractTextFromPDFBytes
        /// <summary>
        /// This method processes an uncompressed Adobe (text) object 
        /// and extracts text.
        /// </summary>
        /// <param name="input">uncompressed</param>
        /// <returns></returns>
        public string ExtractTextFromPDFBytes(byte[] input)
        {
            if (input == null || input.Length == 0) return "";

            try
            {
                string resultString = "";

                // Flag showing if we are we currently inside a text object
                bool inTextObject = false;

                // Flag showing if the next character is literal 
                // e.g. '\' to get a '\' character or '\(' to get '('
                bool nextLiteral = false;

                // () Bracket nesting level. Text appears inside ()
                int bracketDepth = 0;

                // Keep previous chars to get extract numbers etc.:
                char[] previousCharacters = new char[_numberOfCharsToKeep];
                for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' ';


                for (int i = 0; i < input.Length; i++)
                {
                    char c = (char)input[i];
                    if (input[i] == 213)
                        c = "'".ToCharArray()[0];

                    if (inTextObject)
                    {
                        // Position the text
                        if (bracketDepth == 0)
                        {
                            if (CheckToken(new string[] { "TD", "Td" }, previousCharacters))
                            {
                                resultString += "\n\r";
                            }
                            else
                            {
                                if (CheckToken(new string[] { "'", "T*", "\"" }, previousCharacters))
                                {
                                    resultString += "\n";
                                }
                                else
                                {
                                    if (CheckToken(new string[] { "Tj" }, previousCharacters))
                                    {
                                        resultString += " ";
                                    }
                                }
                            }
                        }

                        // End of a text object, also go to a new line.
                        if (bracketDepth == 0 &&
                            CheckToken(new string[] { "ET" }, previousCharacters))
                        {

                            inTextObject = false;
                            resultString += " ";
                        }
                        else
                        {
                            // Start outputting text
                            if ((c == '(') && (bracketDepth == 0) && (!nextLiteral))
                            {
                                bracketDepth = 1;
                            }
                            else
                            {
                                // Stop outputting text
                                if ((c == ')') && (bracketDepth == 1) && (!nextLiteral))
                                {
                                    bracketDepth = 0;
                                }
                                else
                                {
                                    // Just a normal text character:
                                    if (bracketDepth == 1)
                                    {
                                        // Only print out next character no matter what. 
                                        // Do not interpret.
                                        if (c == '\' && !nextLiteral)
                                        {
                                            resultString += c.ToString();
                                            nextLiteral = true;
                                        }
                                        else
                                        {
                                            if (((c >= ' ') && (c <= '~')) ||
                                                ((c >= 128) && (c < 255)))
                                            {
                                                resultString += c.ToString();
                                            }

                                            nextLiteral = false;
                                        }
                                    }
                                }
                            }
                        }
                    }

                    // Store the recent characters for 
                    // when we have to go back for a checking
                    for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
                    {
                        previousCharacters[j] = previousCharacters[j + 1];
                    }
                    previousCharacters[_numberOfCharsToKeep - 1] = c;

                    // Start of a text object
                    if (!inTextObject && CheckToken(new string[] { "BT" }, previousCharacters))
                    {
                        inTextObject = true;
                    }
                }

                return CleanupContent(resultString);
            }
            catch
            {
                return "";
            }
        }

        private string CleanupContent(string text)
        {
            string[] patterns = { @"\\(", @"\\)", @"\226", @"\222", @"\223", @"\224", @"\340", @"\342", @"\344", @"\300", @"\302", @"\304", @"\351", @"\350", @"\352", @"\353", @"\311", @"\310", @"\312", @"\313", @"\362", @"\364", @"\366", @"\322", @"\324", @"\326", @"\354", @"\356", @"\357", @"\314", @"\316", @"\317", @"\347", @"\307", @"\371", @"\373", @"\374", @"\331", @"\333", @"\334", @"\256", @"\231", @"\253", @"\273", @"\251", @"\221" };
            string[] replace = { "(", ")", "-", "'", "\"", "\"", "à", "a", "?", "à", "?", "?", "é", "è", "ê", "?", "é", "è", "ê", "?", "ò", "?", "?", "ò", "?", "?", "ì", "?", "?", "ì", "?", "?", "?", "?", "ù", "?", "ü", "ù", "?", "ü", "?", "?", "?", "?", "?", "'" };

            for (int i = 0; i < patterns.Length; i++)
            {
                string regExPattern = patterns[i];
                Regex regex = new Regex(regExPattern, RegexOptions.IgnoreCase);
                text = regex.Replace(text, replace[i]);
            }

            return text;
        }

        #endregion

        #region CheckToken
        /// <summary>
        /// Check if a certain 2 character token just came along (e.g. BT)
        /// </summary>
        /// <param name="tokens">the searched token</param>
        /// <param name="recent">the recent character array</param>
        /// <returns></returns>
        private bool CheckToken(string[] tokens, char[] recent)
        {
            try
            {
                foreach (string token in tokens)
                {
                    if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
                        (recent[_numberOfCharsToKeep - 2] == token[1]) &&
                        ((recent[_numberOfCharsToKeep - 1] == ' ') ||
                        (recent[_numberOfCharsToKeep - 1] == 0x0d) ||
                        (recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
                        ((recent[_numberOfCharsToKeep - 4] == ' ') ||
                        (recent[_numberOfCharsToKeep - 4] == 0x0d) ||
                        (recent[_numberOfCharsToKeep - 4] == 0x0a))
                        )
                    {
                        return true;
                    }
                }
            }
            catch (Exception)
            {
              return true;
            }
            return false;
        }
        #endregion
    }
}

i am getting error "Index out of range" in function

我在函数中收到错误“索引超出范围”

CheckToken

检查令牌

回答by Galwegian

You could use something like pdftotextas outlined in the following article: How To Convert Pdf file to text in asp.net

您可以使用以下文章中概述的pdftotext之类的内容:How To Convert Pdf file to text in asp.net

回答by Thorsten79

Depends on what you want to do. Converting pdfs to plain text without formatting can be done with pdftotext or similar solutions. Converting PDF layout to HTML layout is very hard because the PDF design philosophy is very different from how HTML layouting works. Google has some sort of solution for it but it will usually break layout.

取决于你想做什么。可以使用 pdftotext 或类似解决方案将 pdf 转换为无格式的纯文本。将 PDF 布局转换为 HTML 布局非常困难,因为 PDF 设计理念与 HTML 布局的工作方式大不相同。谷歌有一些解决方案,但它通常会破坏布局。

Regarding your CV concept: As CV layout is highly important for customers using a site I would not want to auto-convert PDF CVs to HTML CVs. What pdftotext canoffer you is a plain text where a CV search engine can find parts of the CV.

关于您的简历概念:由于简历布局对于使用网站的客户非常重要,我不想将 PDF 简历自动转换为 HTML 简历。pdftotext可以为您提供的是纯文本,简历搜索引擎可以在其中找到部分简历。

回答by necrostaz

you may convert pdf to image with ghostscript and c# and then publish image on site. for details see this article

您可以使用 ghostscript 和 c# 将 pdf 转换为图像,然后在网站上发布图像。详情请看这篇文章