C# 你如何将 Html 转换为纯文本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/286813/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-03 21:43:32  来源:igfitidea点击:

How do you convert Html to plain text?

c#asp.nethtml

提问by Stuart Helwig

I have snippets of Html stored in a table. Not entire pages, no tags or the like, just basic formatting.

我将 Html 片段存储在表中。不是整个页面,没有标签之类的,只是基本的格式。

I would like to be able to display that Html as text only, no formatting, on a given page (actually just the first 30 - 50 characters but that's the easy bit).

我希望能够在给定页面上仅将 Html 显示为文本,没有格式(实际上只是前 30 - 50 个字符,但这很简单)。

How do I place the "text" within that Html into a string as straight text?

如何将 Html 中的“文本”作为纯文本放入字符串中?

So this piece of code.

所以这段代码。

<b>Hello World.</b><br/><p><i>Is there anyone out there?</i><p>

Becomes:

变成:

Hello World. Is there anyone out there?

你好,世界。有没有人在那里?

采纳答案by vfilby

If you are talking about tag stripping, it is relatively straight forward if you don't have to worry about things like <script>tags. If all you need to do is display the text without the tags you can accomplish that with a regular expression:

如果你在谈论标签剥离,如果你不必担心<script>标签之类的事情,那就相对简单了。如果您需要做的只是显示没有标签的文本,您可以使用正则表达式来完成:

<[^>]*>

If you do have to worry about <script>tags and the like then you'll need something a bit more powerful then regular expressions because you need to track state, omething more like a Context Free Grammar (CFG). Althought you might be able to accomplish it with 'Left To Right' or non-greedy matching.

如果您确实必须担心<script>标签等,那么您将需要比正则表达式更强大的东西,因为您需要跟踪状态,更像是上下文无关文法 (CFG)。虽然您可以通过“从左到右”或非贪婪匹配来完成它。

If you can use regular expressions there are many web pages out there with good info:

如果您可以使用正则表达式,那么有很多网页都有很好的信息:

If you need the more complex behaviour of a CFG I would suggest using a third party tool, unfortunately I don't know of a good one to recommend.

如果您需要更复杂的 CFG 行为,我建议您使用第三方工具,不幸的是,我不知道有什么好的推荐。

回答by José Leal

public static string StripTags2(string html) { return html.Replace("<", "<").Replace(">", ">"); }

public static string StripTags2(string html) { return html.Replace("<", "<").Replace(">", ">"); }

By this you escape all "<" and ">" in a string. Is this what you want?

通过这种方式,您可以转义字符串中的所有“<”和“>”。这是你想要的吗?

回答by Corey Trager

If you have data that has HTML tags and you want to display it so that a person can SEE the tags, use HttpServerUtility::HtmlEncode.

如果您的数据具有 HTML 标记,并且您希望显示它以便人们可以看到标记,请使用 HttpServerUtility::HtmlEncode。

If you have data that has HTML tags in it and you want the user to see the tags rendered, then display the text as is. If the text represents an entire web page, use an IFRAME for it.

如果您的数据中包含 HTML 标记,并且您希望用户看到呈现的标记,则按原样显示文本。如果文本代表整个网页,请为其使用 IFRAME。

If you have data that has HTML tags and you want to strip out the tags and just display the unformatted text, use a regular expression.

如果您的数据包含 HTML 标记,并且您想去除标记并仅显示未格式化的文本,请使用正则表达式。

回答by mpez0

Depends on what you mean by "html." The most complex case would be complete web pages. That's also the easiest to handle, since you can use a text-mode web browser. See the Wikipedia articlelisting web browsers, including text mode browsers. Lynx is probably the best known, but one of the others may be better for your needs.

取决于你所说的“html”是什么意思。最复杂的情​​况是完整的网页。这也是最容易处理的,因为您可以使用文本模式的 Web 浏览器。请参阅列出 Web 浏览器(包括文本模式浏览器)的Wikipedia 文章。Lynx 可能是最著名的,但其他之一可能更适合您的需求。

回答by George Stocker

HTTPUtility.HTMLEncode()is meant to handle encoding HTML tags as strings. It takes care of all the heavy lifting for you. From the MSDN Documentation:

HTTPUtility.HTMLEncode()旨在将 HTML 标签编码为字符串。它会为您处理所有繁重的工作。从MSDN 文档

If characters such as blanks and punctuation are passed in an HTTP stream, they might be misinterpreted at the receiving end. HTML encoding converts characters that are not allowed in HTML into character-entity equivalents; HTML decoding reverses the encoding. For example, when embedded in a block of text, the characters <and >, are encoded as &lt;and &gt;for HTTP transmission.

如果在 HTTP 流中传递诸如空格和标点符号之类的字符,它们可能会在接收端被误解。HTML 编码将 HTML 中不允许的字符转换为字符实体等价物;HTML 解码反转编码。例如,当嵌入到文本块中时,字符<>被编码为&lt;&gt;用于 HTTP 传输。

HTTPUtility.HTMLEncode()method, detailed here:

HTTPUtility.HTMLEncode()方法,详细在这里

public static void HtmlEncode(
  string s,
  TextWriter output
)

Usage:

用法:

String TestString = "This is a <Test String>.";
StringWriter writer = new StringWriter();
Server.HtmlEncode(TestString, writer);
String EncodedString = writer.ToString();

回答by Judah Gabriel Himango

The free and open source HtmlAgilityPackhas in one of its samplesa method that converts from HTML to plain text.

的自由和开源HtmlAgilityPack具有在其样品中的一个方法,该方法从HTML为纯文本转换。

var plainText = HtmlUtilities.ConvertToPlainText(string html);

Feed it an HTML string like

给它一个 HTML 字符串,比如

<b>hello, <i>world!</i></b>

And you'll get a plain text result like:

你会得到一个纯文本结果,如:

hello world!

回答by WEFX

To add to vfilby's answer, you can just perform a RegEx replace within your code; no new classes are necessary. In case other newbies like myself stumple upon this question.

要添加到 vfilby 的答案中,您只需在代码中执行 RegEx 替换即可;不需要新的课程。以防其他像我这样的新手遇到这个问题。

using System.Text.RegularExpressions;

Then...

然后...

private string StripHtml(string source)
{
        string output;

        //get rid of HTML tags
        output = Regex.Replace(source, "<[^>]*>", string.Empty);

        //get rid of multiple blank lines
        output = Regex.Replace(output, @"^\s*$\n", string.Empty, RegexOptions.Multiline);

        return output;
}

回答by mikhail-t

I think the easiest way is to make a 'string' extension method (based on what user Richard have suggested):

我认为最简单的方法是制作一个“字符串”扩展方法(基于用户理查德的建议):

using System;
using System.Text.RegularExpressions;

public static class StringHelpers
{
    public static string StripHTML(this string HTMLText)
        {
            var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
            return reg.Replace(HTMLText, "");
        }
}

Then just use this extension method on any 'string' variable in your program:

然后只需在程序中的任何“字符串”变量上使用此扩展方法:

var yourHtmlString = "<div class=\"someclass\"><h2>yourHtmlText</h2></span>";
var yourTextString = yourHtmlString.StripHTML();

I use this extension method to convert html formated comments to plain text so it will be displayed correctly on a crystal report, and it works perfect!

我使用这个扩展方法将 html 格式的注释转换为纯文本,这样它就会在水晶报告上正确显示,而且效果很好!

回答by Amine

There not a method with the name 'ConvertToPlainText' in the HtmlAgilityPack but you can convert a html string to CLEAR string with :

HtmlAgilityPack 中没有名称为“ConvertToPlainText”的方法,但您可以使用以下命令将 html 字符串转换为 CLEAR 字符串:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlString);
var textString = doc.DocumentNode.InnerText;
Regex.Replace(textString , @"<(.|n)*?>", string.Empty).Replace("&nbsp", "");

Thats works for me. BUT I DONT FIND A METHOD WITH NAME 'ConvertToPlainText' IN 'HtmlAgilityPack'.

那对我有用。但我没有在“HtmlAgilityPack”中找到名为“ConvertToPlainText”的方法。

回答by Ben Anderson

I could not use HtmlAgilityPack, so I wrote a second best solution for myself

我无法使用 HtmlAgilityPack,所以我为自己编写了第二个最佳解决方案

private static string HtmlToPlainText(string html)
{
    const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<'
    const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing
    const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR />
    var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline);
    var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline);
    var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline);

    var text = html;
    //Decode html specific characters
    text = System.Net.WebUtility.HtmlDecode(text); 
    //Remove tag whitespace/line breaks
    text = tagWhiteSpaceRegex.Replace(text, "><");
    //Replace <br /> with line breaks
    text = lineBreakRegex.Replace(text, Environment.NewLine);
    //Strip formatting
    text = stripFormattingRegex.Replace(text, string.Empty);

    return text;
}