如何在 C# 中将 HTML 转换为文本？

Question

提问by Matt Crouch

I'm looking for C# code to convert an HTML document to plain text.

我正在寻找将 HTML 文档转换为纯文本的 C# 代码。

I'm not looking for simple tag stripping , but something that will output plain text with a reasonablepreservation of the original layout.

我不是在寻找简单的标签剥离，而是在合理保留原始布局的情况下输出纯文本的东西。

The output should look like this:

输出应如下所示：

Html2Txt at W3C

W3C 上的 Html2Txt

I've looked at the HTML Agility Pack, but I don't think that's what I need. Does anyone have any other suggestions?

我看过 HTML Agility Pack，但我认为这不是我需要的。有没有人有其他建议？

EDIT:I just download the HTML Agility Pack from CodePlex, and ran the Html2Txt project. What a disappointment (at least the module that does html to text conversion)! All it did was strip the tags, flatten the tables, etc. The output didn't look anything like the Html2Txt @ W3C produced. Too bad that source doesn't seem to be available. I was looking to see if there is a more "canned" solution available.

编辑：我只是从CodePlex下载 HTML Agility Pack ，然后运行 Html2Txt 项目。多么令人失望（至少是将 html 转换为文本的模块）！它所做的只是剥离标签、展平表格等。输出看起来与 Html2Txt @ W3C 生成的完全不同。太糟糕了，源似乎不可用。我想看看是否有更“罐头”的解决方案可用。

EDIT 2:Thank you everybody for your suggestions. FlySwattipped me in the direction i wanted to go. I can use the System.Diagnostics.Processclass to run lynx.exe with the "-dump" switch to send the text to standard output, and capture the stdout with ProcessStartInfo.UseShellExecute = falseand ProcessStartInfo.RedirectStandardOutput = true. I'll wrap all this in a C# class. This code will be called only occassionly, so i'm not too concerned about spawning a new process vs. doing it in code. Plus, Lynx is FAST!!

编辑2：谢谢大家的建议。 FlySwat把我引向了我想去的方向。我可以使用System.Diagnostics.Process类的“突降”开关运行lynx.exe将文本发送到标准输出，并与捕获标准输出ProcessStartInfo.UseShellExecute = false和ProcessStartInfo.RedirectStandardOutput = true。我将把所有这些都包装在一个 C# 类中。这段代码只会偶尔被调用，所以我不太关心产生新进程与在代码中执行它。另外，Lynx 很快！！

Answer 1

采纳答案by FlySwat

What you are looking for is a text-mode DOM renderer that outputs text, much like Lynx or other Text browsers...This is much harder to do than you would expect.

您正在寻找的是输出文本的文本模式 DOM 渲染器，很像 Lynx 或其他文本浏览器……这比您预期的要困难得多。

Answer 2

回答by EricSchaefer

The easiest would probably be tag stripping combined with replacement of some tags with text layout elements like dashes for list elements (li) and line breaks for br's and p's. It shouldn't be too hard to extend this to tables.

最简单的可能是标签剥离结合使用文本布局元素替换某些标签，例如列表元素 (li) 的破折号和 br 和 p 的换行符。将其扩展到表格应该不会太难。

Answer 3

回答by inspite

Have you tried http://www.aaronsw.com/2002/html2text/it's Python, but open source.

您是否尝试过http://www.aaronsw.com/2002/html2text/它是 Python，但它是开源的。

Answer 4

回答by jw.

I don't know C#, but there is a fairly small & easy to read python html2txt script here: http://www.aaronsw.com/2002/html2text/

我不知道 C#，但这里有一个相当小且易于阅读的 python html2txt 脚本：http://www.aaronsw.com/2002/html2text/

Answer 5

回答by madcolor

I've heard from a reliable source that, if you're doing HTML parsing in .Net, you should look at the HTML agility pack again..

我从可靠的消息来源听说，如果您在 .Net 中进行 HTML 解析，您应该再次查看 HTML 敏捷包。

http://www.codeplex.com/htmlagilitypack

Some sample on SO..

SO上的一些示例..

HTML Agility pack - parsing tables

HTML 敏捷包 - 解析表

Answer 6

回答by crb

Another postsuggests the HTML agility pack:

另一篇文章建议使用HTML 敏捷包：

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

这是一个敏捷的 HTML 解析器，它构建了一个读/写 DOM 并支持普通的 XPATH 或 XSLT（你实际上不必了解 XPATH 或 XSLT 来使用它，别担心......）。它是一个 .NET 代码库，允许您解析“网络之外”的 HTML 文件。解析器对“现实世界”格式错误的 HTML 非常宽容。对象模型与 System.Xml 建议的非常相似，但用于 HTML 文档（或流）。

Answer 7

回答by Richard

You could use this:

你可以用这个：

 public static string StripHTML(string HTMLText, bool decode = true)
        {
            Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
            var stripped = reg.Replace(HTMLText, "");
            return decode ? HttpUtility.HtmlDecode(stripped) : stripped;
        }

Updated

更新

Thanks for the comments I have updated to improve this function

感谢您为改进此功能而更新的评论

Answer 8

回答by Brian Genisio

I have used Detaggerin the past. It does a pretty good job of formatting the HTML as text and is more than just a tag remover.

我过去使用过Detagger。它在将 HTML 格式化为文本方面做得非常好，而且不仅仅是一个标签移除器。

Answer 9

回答by Maxim

This is another solution to convert HTML to Text or RTF in C#:

这是在 C# 中将 HTML 转换为文本或 RTF 的另一种解决方案：

    SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
    h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode;
    string text = h.ConvertString(htmlString);

This library is not free, this is commercial product and it is my own product.

这个库不是免费的，这是商业产品，它是我自己的产品。

Answer 10

回答by ProNotion

I have recently blogged on a solutionthat worked for me by using a Markdown XSLT file to transform the HTML Source. The HTML source will of course need to be valid XML first

我最近在博客上写了一个对我有用的解决方案，它使用 Markdown XSLT 文件来转换 HTML 源。HTML 源代码当然首先需要是有效的 XML

如何在 C# 中将 HTML 转换为文本？

提问by Matt Crouch

采纳答案by FlySwat

回答by EricSchaefer

回答by inspite

回答by jw.

回答by madcolor

回答by crb

回答by Richard

回答by Brian Genisio

回答by Maxim

回答by ProNotion

相关推荐

最近更新

标签

如何在 C# 中将 HTML 转换为文本？

提问by Matt Crouch

采纳答案by FlySwat

回答by EricSchaefer

回答by inspite

回答by jw.

回答by madcolor

回答by crb

回答by Richard

回答by Brian Genisio

回答by Maxim

回答by ProNotion

相关推荐

C# 如何判断 XML 格式是否正确？

C#：如何在 ListView 中添加子项

C# .NET 定时器异步运行吗？

C# Rhino 使用 AAA 在属性 getter 上模拟 AssertWasCalled（多次）

相关推荐

最近更新

标签