如何在 .NET 中将 HTML 转换为 RTF(富文本)而无需支付组件费用?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/150208/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I convert HTML to RTF (Rich Text) in .NET without paying for a component?
提问by Josh Kodroff
Is there a free third-party or .NET class that will convert HTML to RTF (for use in a rich-text enabled Windows Forms control)?
是否有免费的第三方或 .NET 类可以将 HTML 转换为 RTF(用于启用富文本的 Windows 窗体控件)?
The "free" requirement comes from the fact that I'm only working on a prototype and can just load the BrowserControl and just render HTML if need be (even if it is slow) and that Developer Express is going to be releasing their own such control soon-ish.
“免费”要求来自这样一个事实,即我只在原型上工作并且可以只加载 BrowserControl 并在需要时渲染 HTML(即使它很慢)而且 Developer Express 将发布他们自己的这样的控制很快。
I don't want to learn to write RTF by hand, and I already know HTML, so I figure this is the quickest way to get some demonstrable code out the door quickly.
我不想学习手工编写 RTF,而且我已经知道 HTML,所以我认为这是快速获得一些可演示代码的最快方法。
回答by Spartaco
Actually there is a simple and freesolution: use your browser, ok this is the trick I used:
实际上有一个简单且免费的解决方案:使用您的浏览器,这就是我使用的技巧:
var webBrowser = new WebBrowser();
webBrowser.CreateControl(); // only if needed
webBrowser.DocumentText = *yourhtmlstring*;
while (_webBrowser.DocumentText != *yourhtmlstring*)
Application.DoEvents();
webBrowser.Document.ExecCommand("SelectAll", false, null);
webBrowser.Document.ExecCommand("Copy", false, null);
*yourRichTextControl*.Paste();
This could be slower than other methods but at least it's free and works!
这可能比其他方法慢,但至少它是免费的并且有效!
回答by Jonathan Parker
Check out this CodeProject article on XHTML2RTF.
查看这篇关于XHTML2RTF 的CodeProject 文章。
回答by cjbarth
Expanding on Spartaco's answer I implimented the following which works GREAT!
扩展 Spartaco 的回答,我暗示了以下效果很好!
Using reportWebBrowser As New WebBrowser
reportWebBrowser.CreateControl()
reportWebBrowser.DocumentText = sbHTMLDoc.ToString
While reportWebBrowser.DocumentText <> sbHTMLDoc.ToString
Application.DoEvents()
End While
reportWebBrowser.Document.ExecCommand("SelectAll", False, Nothing)
reportWebBrowser.Document.ExecCommand("Copy", False, Nothing)
Using reportRichTextBox As New RichTextBox
reportRichTextBox.Paste()
reportRichTextBox.SaveFile(DocumentFileName)
End Using
End Using
回答by Andrew
It is not perfect of course, but here is the code I use to convert HTML to plain text.
当然,它并不完美,但这是我用来将 HTML 转换为纯文本的代码。
(I was not the original author, I adapted it from code found on the web)
(我不是原作者,是根据网上找到的代码改编的)
public static string ConvertHtmlToText(string source) {
string result;
// Remove HTML Development formatting
// Replace line breaks with space
// because browsers inserts space
result = source.Replace("\r", " ");
// Replace line breaks with space
// because browsers inserts space
result = result.Replace("\n", " ");
// Remove step-formatting
result = result.Replace("\t", string.Empty);
// Remove repeating speces becuase browsers ignore them
result = System.Text.RegularExpressions.Regex.Replace(result,
@"( )+", " ");
// Remove the header (prepare first by clearing attributes)
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*head([^>])*>", "<head>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"(<( )*(/)( )*head( )*>)", "</head>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(<head>).*(</head>)", string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// remove all scripts (prepare first by clearing attributes)
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*script([^>])*>", "<script>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"(<( )*(/)( )*script( )*>)", "</script>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
//result = System.Text.RegularExpressions.Regex.Replace(result,
// @"(<script>)([^(<script>\.</script>)])*(</script>)",
// string.Empty,
// System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"(<script>).*(</script>)", string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// remove all styles (prepare first by clearing attributes)
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*style([^>])*>", "<style>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"(<( )*(/)( )*style( )*>)", "</style>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(<style>).*(</style>)", string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// insert tabs in spaces of <td> tags
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*td([^>])*>", "\t",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// insert line breaks in places of <BR> and <LI> tags
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*br( )*>", "\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*li( )*>", "\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// insert line paragraphs (double line breaks) in place
// if <P>, <DIV> and <TR> tags
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*div([^>])*>", "\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*tr([^>])*>", "\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*p([^>])*>", "\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// Remove remaining tags like <a>, links, images,
// comments etc - anything thats enclosed inside < >
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<[^>]*>", string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// replace special characters:
result = System.Text.RegularExpressions.Regex.Replace(result,
@" ", " ",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"•", " * ",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"‹", "<",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"›", ">",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"™", "(tm)",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"⁄", "/",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<", "<",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@">", ">",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"©", "(c)",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"®", "(r)",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// Remove all others. More can be added, see
// http://hotwired.lycos.com/webmonkey/reference/special_characters/
result = System.Text.RegularExpressions.Regex.Replace(result,
@"&(.{2,6});", string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// make line breaking consistent
result = result.Replace("\n", "\r");
// Remove extra line breaks and tabs:
// replace over 2 breaks with 2 and over 4 tabs with 4.
// Prepare first to remove any whitespaces inbetween
// the escaped characters and remove redundant tabs inbetween linebreaks
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\r)( )+(\r)", "\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\t)( )+(\t)", "\t\t",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\t)( )+(\r)", "\t\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\r)( )+(\t)", "\r\t",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// Remove redundant tabs
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\r)(\t)+(\r)", "\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// Remove multible tabs followind a linebreak with just one tab
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\r)(\t)+", "\r\t",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// Initial replacement target string for linebreaks
string breaks = "\r\r\r";
// Initial replacement target string for tabs
string tabs = "\t\t\t\t\t";
for (int index = 0; index < result.Length; index++) {
result = result.Replace(breaks, "\r\r");
result = result.Replace(tabs, "\t\t\t\t");
breaks = breaks + "\r";
tabs = tabs + "\t";
}
// Thats it.
return result;
}
回答by NtFreX
TL;DR:I recommend using the OpenXml
format and the HtmlToOpenXml
nuget package if possible.
TL;DR:如果可能,我建议使用OpenXml
格式和HtmlToOpenXml
nuget 包。
Microsoft Word COM
微软Word COM
I haven't really searched much into this topic as a my use case is to use the functionality on a server which makes COM components not a great selection.
我并没有真正深入研究这个主题,因为我的用例是在服务器上使用该功能,这使得 COM 组件不是一个很好的选择。
XHTML2RTF
XHTML2RTF
As @JonathanParker mentioned you can use this codeproject library.
正如@JonathanParker 提到的,您可以使用这个 codeproject 库。
Disadvantages are:
缺点是:
- Limited supported HTML and CSS
- Not really .NET
- ...
- 有限支持的 HTML 和 CSS
- 不是真的.NET
- ...
Windows Forms Web Browser
Windows 窗体 Web 浏览器
As @Spartaco mentioned you can use the Windows Forms WebBrowser
control.
正如@Spartaco 提到的,您可以使用 Windows 窗体WebBrowser
控件。
Disadvantages are:
缺点是:
- Reference to System.Windows.Forms
- Uses copy & paste (problematic for multithreading)
- Only works in an STA thread
- 参考 System.Windows.Forms
- 使用复制和粘贴(多线程有问题)
- 仅适用于 STA 线程
Not supported features include:
不支持的功能包括:
- Fonts
- Colors
- Numbered lists
- Strikethrough (
del
element) - ...
- 字体
- 颜色
- 编号列表
- 删除线(
del
元素) - ...
DevExpress
开发速递
Code sample of "Paul V" from the devexpress support center. (03.02.2015)
来自devexpress 支持中心的“Paul V”代码示例。(03.02.2015)
public String ConvertRTFToHTML(String RTF)
{
MemoryStream ms = new MemoryStream();
StreamWriter writer = new StreamWriter(ms);
writer.Write(RTF);
writer.Flush();
ms.Position = 0;
String output = "";
HtmlEditorExtension.Import(HtmlEditorImportFormat.Rtf, ms, (s, enumerable) => output = s);
return output;
}
public String ConvertHTMLToRTF(String Html)
{
MemoryStream ms = new MemoryStream();
var editor = new ASPxHtmlEditor { Html = html };
editor.Export(HtmlEditorExportFormat.Rtf, ms);
ms.Position = 0;
StreamReader reader = new StreamReader(ms);
return reader.ReadToEnd();
}
Or you could use the RichEditDocumentServer
type as shown in this example.
或者您可以使用本示例中RichEditDocumentServer
所示的类型。
- A license for devexpresscan coast from around 1500.- USD to 2200.- USD.
- devexpress的许可证可以从大约 1500.- 美元到 2200.- 美元。
Unknown what actually is supported.
未知实际支持什么。
Disadvantages are:
缺点是:
- Price
- Quite a lot of references for one small thing
- More?
- 价钱
- 一件小事的参考资料相当多
- 更多的?
Not supported features include:
不支持的功能包括:
- Striketrough (
del
element)
- 罢工(
del
元素)
Sautinsoft
软体
public string ConvertHTMLToRTF(string html)
{
SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
return h.ConvertString(htmlString);
}
public string ConvertRTFToHTML(string rtf)
{
SautinSoft.RtfToHtml r = new SautinSoft.RtfToHtml();
byte[] bytes = Encoding.ASCII.GetBytes(rtf);
r.OpenDocx(bytes );
return r.ToHtml();
}
More examples and configuration options can be found hereand here.
- A licence for this componentcan coast from 400.- USD to 2000.- USD.
- HTML 3.2
- HTML 4.01
- HTML 5
- CSS
- XHTML
- HTML 3.2
- HTML 4.01
- HTML 5
- CSS
- XHTML
Disadvantages are:
缺点是:
- I'm not sure how active the development is
- Price
- 我不确定开发的活跃程度
- 价钱
Usage knowledgebase:
使用知识库:
- Converting numbered lists from the trix angular editordestroys indend
- 从trix 角度编辑器转换编号列表会破坏 indend
DIY
DIY
If you only wanted to support limited functionality you could write your own converter. I would not recommend this if the supported feature set is too large.
如果您只想支持有限的功能,您可以编写自己的转换器。如果支持的功能集太大,我不建议这样做。
I have a small sample project herebut is only for educational purposes in its current state.
OpenXml
打开XML
If the OpenXml formatis also ok for your use case you can use the HtmlToOpenXml nuget package. Its free and did support all features I've tested the other solutions against.
如果OpenXml 格式也适合您的用例,您可以使用HtmlToOpenXml nuget 包。它是免费的,并且确实支持我测试过其他解决方案的所有功能。
The projectis based on the Open Xml SDKby microsoft and seems active.
该项目基于microsoft的Open Xml SDK,看起来很活跃。
public static byte[] ConvertHtmlToOpenXml(string html)
{
using (var generatedDocument = new MemoryStream())
{
using (var package = WordprocessingDocument.Create(generatedDocument, WordprocessingDocumentType.Document))
{
var mainPart = package.MainDocumentPart;
if (mainPart == null)
{
mainPart = package.AddMainDocumentPart();
new Document(new Body()).Save(mainPart);
}
var converter = new HtmlConverter(mainPart);
converter.ParseHtml(html);
mainPart.Document.Save();
}
return generatedDocument.ToArray();
}
}
回答by GvS
Maybe what you need is a control to edit the HTML?
也许您需要的是一个编辑 HTML 的控件?
回答by Jacek Krawczyk
I recommend a console tool named Pandoc. It is not a component, it is rather huge conversion pack. I am using it to convert between HTML and LaTeX. It is just awesome.
我推荐一个名为Pandoc的控制台工具。它不是一个组件,它是一个相当大的转换包。我正在使用它在 HTML 和 LaTeX 之间进行转换。这真是太棒了。
The full list of supported formats you can find on the program page.
您可以在程序页面上找到支持格式的完整列表。
In order to convert an HTML document to RTF format you write on the console:
为了将 HTML 文档转换为 RTF 格式,您可以在控制台上编写:
pandoc filename.html -f html -t rtf -s -o filename.rtf