如何使用 C# 将 PDF 转换为 HTML
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2295555/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to convert PDF into HTML using C#
提问by Radhi
i have to read pdf and create html document... for uploaded cv in my site... i can not use any shareware. please can anybody suggest me the best solution for converting pdf to html... or read pdf content using C#
我必须阅读 pdf 并创建 html 文档...用于在我的网站上上传的简历...我不能使用任何共享软件。请有人建议我将pdf转换为html的最佳解决方案...或使用C#阅读pdf内容
site is developed in C#, asp.net 3.5
网站是用 C#、asp.net 3.5 开发的
采纳答案by Radhi
i got one code from here
我从这里得到了一个代码
and downlaoded itextsharp dll
并下载了itextsharp dll
and this code worked fine... only one problem i faced... that my most of files are converted into text or html or whatever frmat i want... but only 1 file i am not able to convert... if anybody can use this code and help me to find out what is the proble... i'll be thankful to him/her.
并且这段代码运行良好......我遇到的只有一个问题......我的大部分文件被转换为文本或html或我想要的任何frmat......但只有1个文件我无法转换......如果有人可以使用此代码并帮助我找出问题所在......我会感谢他/她。
you can see code here...
你可以在这里看到代码...
using System;
using System.IO;
using iTextSharp.text.pdf;
using System.Text.RegularExpressions;
namespace PDFReader
{
/// <summary>
/// Parses a PDF file and extracts the text from it.
/// </summary>
public class PDFParser
{
/// BT = Beginning of a text object operator
/// ET = End of a text object operator
/// Td move to the start of next line
/// 5 Ts = superscript
/// -5 Ts = subscript
#region Fields
#region _numberOfCharsToKeep
/// <summary>
/// The number of characters to keep, when extracting text.
/// </summary>
private static int _numberOfCharsToKeep = 15;
#endregion
#endregion
#region ExtractText
/// <summary>
/// Extracts a text from a PDF file.
/// </summary>
/// <param name="inFileName">the full path to the pdf file.</param>
/// <param name="outFileName">the output file name.</param>
/// <returns>the extracted text</returns>
public bool ExtractText(string inFileName, string outFileName)
{
StreamWriter outFile = null;
try
{
// Create a reader for the given PDF file
PdfReader reader = new PdfReader(inFileName);
//outFile = File.CreateText(outFileName);
outFile = new StreamWriter(outFileName, false, System.Text.Encoding.UTF8);
Console.Write("Processing: ");
int totalLen = 68;
float charUnit = ((float)totalLen) / (float)reader.NumberOfPages;
int totalWritten = 0;
float curUnit = 0;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
outFile.Write(ExtractTextFromPDFBytes(reader.GetPageContent(page)) + " ");
// Write the progress.
if (charUnit >= 1.0f)
{
for (int i = 0; i < (int)charUnit; i++)
{
Console.Write("#");
totalWritten++;
}
}
else
{
curUnit += charUnit;
if (curUnit >= 1.0f)
{
for (int i = 0; i < (int)curUnit; i++)
{
Console.Write("#");
totalWritten++;
}
curUnit = 0;
}
}
}
if (totalWritten < totalLen)
{
for (int i = 0; i < (totalLen - totalWritten); i++)
{
Console.Write("#");
}
}
return true;
}
catch
{
return false;
}
finally
{
if (outFile != null) outFile.Close();
}
}
#endregion
#region ExtractTextFromPDFBytes
/// <summary>
/// This method processes an uncompressed Adobe (text) object
/// and extracts text.
/// </summary>
/// <param name="input">uncompressed</param>
/// <returns></returns>
public string ExtractTextFromPDFBytes(byte[] input)
{
if (input == null || input.Length == 0) return "";
try
{
string resultString = "";
// Flag showing if we are we currently inside a text object
bool inTextObject = false;
// Flag showing if the next character is literal
// e.g. '\' to get a '\' character or '\(' to get '('
bool nextLiteral = false;
// () Bracket nesting level. Text appears inside ()
int bracketDepth = 0;
// Keep previous chars to get extract numbers etc.:
char[] previousCharacters = new char[_numberOfCharsToKeep];
for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' ';
for (int i = 0; i < input.Length; i++)
{
char c = (char)input[i];
if (input[i] == 213)
c = "'".ToCharArray()[0];
if (inTextObject)
{
// Position the text
if (bracketDepth == 0)
{
if (CheckToken(new string[] { "TD", "Td" }, previousCharacters))
{
resultString += "\n\r";
}
else
{
if (CheckToken(new string[] { "'", "T*", "\"" }, previousCharacters))
{
resultString += "\n";
}
else
{
if (CheckToken(new string[] { "Tj" }, previousCharacters))
{
resultString += " ";
}
}
}
}
// End of a text object, also go to a new line.
if (bracketDepth == 0 &&
CheckToken(new string[] { "ET" }, previousCharacters))
{
inTextObject = false;
resultString += " ";
}
else
{
// Start outputting text
if ((c == '(') && (bracketDepth == 0) && (!nextLiteral))
{
bracketDepth = 1;
}
else
{
// Stop outputting text
if ((c == ')') && (bracketDepth == 1) && (!nextLiteral))
{
bracketDepth = 0;
}
else
{
// Just a normal text character:
if (bracketDepth == 1)
{
// Only print out next character no matter what.
// Do not interpret.
if (c == '\' && !nextLiteral)
{
resultString += c.ToString();
nextLiteral = true;
}
else
{
if (((c >= ' ') && (c <= '~')) ||
((c >= 128) && (c < 255)))
{
resultString += c.ToString();
}
nextLiteral = false;
}
}
}
}
}
}
// Store the recent characters for
// when we have to go back for a checking
for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
{
previousCharacters[j] = previousCharacters[j + 1];
}
previousCharacters[_numberOfCharsToKeep - 1] = c;
// Start of a text object
if (!inTextObject && CheckToken(new string[] { "BT" }, previousCharacters))
{
inTextObject = true;
}
}
return CleanupContent(resultString);
}
catch
{
return "";
}
}
private string CleanupContent(string text)
{
string[] patterns = { @"\\(", @"\\)", @"\226", @"\222", @"\223", @"\224", @"\340", @"\342", @"\344", @"\300", @"\302", @"\304", @"\351", @"\350", @"\352", @"\353", @"\311", @"\310", @"\312", @"\313", @"\362", @"\364", @"\366", @"\322", @"\324", @"\326", @"\354", @"\356", @"\357", @"\314", @"\316", @"\317", @"\347", @"\307", @"\371", @"\373", @"\374", @"\331", @"\333", @"\334", @"\256", @"\231", @"\253", @"\273", @"\251", @"\221" };
string[] replace = { "(", ")", "-", "'", "\"", "\"", "à", "a", "?", "à", "?", "?", "é", "è", "ê", "?", "é", "è", "ê", "?", "ò", "?", "?", "ò", "?", "?", "ì", "?", "?", "ì", "?", "?", "?", "?", "ù", "?", "ü", "ù", "?", "ü", "?", "?", "?", "?", "?", "'" };
for (int i = 0; i < patterns.Length; i++)
{
string regExPattern = patterns[i];
Regex regex = new Regex(regExPattern, RegexOptions.IgnoreCase);
text = regex.Replace(text, replace[i]);
}
return text;
}
#endregion
#region CheckToken
/// <summary>
/// Check if a certain 2 character token just came along (e.g. BT)
/// </summary>
/// <param name="tokens">the searched token</param>
/// <param name="recent">the recent character array</param>
/// <returns></returns>
private bool CheckToken(string[] tokens, char[] recent)
{
try
{
foreach (string token in tokens)
{
if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
(recent[_numberOfCharsToKeep - 2] == token[1]) &&
((recent[_numberOfCharsToKeep - 1] == ' ') ||
(recent[_numberOfCharsToKeep - 1] == 0x0d) ||
(recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
((recent[_numberOfCharsToKeep - 4] == ' ') ||
(recent[_numberOfCharsToKeep - 4] == 0x0d) ||
(recent[_numberOfCharsToKeep - 4] == 0x0a))
)
{
return true;
}
}
}
catch (Exception)
{
return true;
}
return false;
}
#endregion
}
}
i am getting error "Index out of range" in function
我在函数中收到错误“索引超出范围”
CheckToken
检查令牌
回答by Galwegian
You could use something like pdftotextas outlined in the following article: How To Convert Pdf file to text in asp.net
您可以使用以下文章中概述的pdftotext之类的内容:How To Convert Pdf file to text in asp.net
回答by Thorsten79
Depends on what you want to do. Converting pdfs to plain text without formatting can be done with pdftotext or similar solutions. Converting PDF layout to HTML layout is very hard because the PDF design philosophy is very different from how HTML layouting works. Google has some sort of solution for it but it will usually break layout.
取决于你想做什么。可以使用 pdftotext 或类似解决方案将 pdf 转换为无格式的纯文本。将 PDF 布局转换为 HTML 布局非常困难,因为 PDF 设计理念与 HTML 布局的工作方式大不相同。谷歌有一些解决方案,但它通常会破坏布局。
Regarding your CV concept: As CV layout is highly important for customers using a site I would not want to auto-convert PDF CVs to HTML CVs. What pdftotext canoffer you is a plain text where a CV search engine can find parts of the CV.
关于您的简历概念:由于简历布局对于使用网站的客户非常重要,我不想将 PDF 简历自动转换为 HTML 简历。pdftotext可以为您提供的是纯文本,简历搜索引擎可以在其中找到部分简历。