在 C# 中将 HTML 实体转换为 Unicode 字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13492497/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-10 08:46:12  来源:igfitidea点击:

Converting HTML entities to Unicode Characters in C#

c#windows-runtimehtml-entitieshtml-encode

提问by Remy

I found similar questions and answers for Python and Javascript, but not for C# or any other WinRT compatible language.

我在 Python 和 Javascript 中找到了类似的问题和答案,但没有为 C# 或任何其他 WinRT 兼容语言找到类似的问题和答案。

The reason I think I need it, is because I'm displaying text I get from websites in a Windows 8 store app. E.g. éshould become é.

我认为我需要它的原因是因为我正在显示从 Windows 8 商店应用程序中的网站获得的文本。例如é应该成为é.

Or is there a better way? I'm not displaying websites or rss feeds, but just a list of websites and their titles.

或者,还有更好的方法?我不显示网站或 RSS 提要,而只是显示网站及其标题的列表。

采纳答案by Blachshma

I recommend using System.Net.WebUtility.HtmlDecodeand NOTHttpUtility.HtmlDecode.

我建议使用System.Net.WebUtility.HtmlDecode不是HttpUtility.HtmlDecode

This is due to the fact that the System.Webreference does not exist in Winforms/WPF/Console applications and you can get the exact same result using this class (which is already added as a reference in all those projects).

这是因为该System.Web引用在 Winforms/WPF/Console 应用程序中不存在,您可以使用此类获得完全相同的结果(已在所有这些项目中作为引用添加)。

Usage:

用法:

string s =  System.Net.WebUtility.HtmlDecode("é"); // Returns é

回答by Mudassir Hasan

Use HttpUtility.HtmlDecode().Read on msdn here

HttpUtility.HtmlDecode()这里使用.Read on msdn

decodedString = HttpUtility.HtmlDecode(myEncodedString)

回答by user1954682

Different coding/encoding of HTML entities and HTML numbers in Metro App and WP8 App.

Metro App 和 WP8 App 中 HTML 实体和 HTML 编号的不同编码/编码。

With Windows Runtime Metro App

使用 Windows 运行时 Metro 应用程序

{
    string inStr = "ó";
    string auxStr = System.Net.WebUtility.HtmlEncode(inStr);
    // auxStr == ó
    string outStr = System.Net.WebUtility.HtmlDecode(auxStr);
    // outStr == ó
    string outStr2 = System.Net.WebUtility.HtmlDecode("ó");
    // outStr2 == ó
}

With Windows Phone 8.0

使用 Windows Phone 8.0

{
    string inStr = "ó";
    string auxStr = System.Net.WebUtility.HtmlEncode(inStr);
    // auxStr == ó
    string outStr = System.Net.WebUtility.HtmlDecode(auxStr);
    // outStr == ó
    string outStr2 = System.Net.WebUtility.HtmlDecode("ó");
    // outStr2 == ó
}

To solve this, in WP8, I have implemented the table in HTML ISO-8859-1 Referencebefore calling System.Net.WebUtility.HtmlDecode().

为了解决这个问题,在WP8,我已经实现了在表HTML ISO-8859-1参考之前调用System.Net.WebUtility.HtmlDecode()

回答by zumey

This might be useful, replaces all (for as far as my requirements go) entities with their unicode equivalent.

这可能很有用,用它们的 unicode 等效替换所有(就我的要求而言)实体。

    public string EntityToUnicode(string html) {
        var replacements = new Dictionary<string, string>();
        var regex = new Regex("(&[a-z]{2,5};)");
        foreach (Match match in regex.Matches(html)) {
            if (!replacements.ContainsKey(match.Value)) { 
                var unicode = HttpUtility.HtmlDecode(match.Value);
                if (unicode.Length == 1) {
                    replacements.Add(match.Value, string.Concat("&#", Convert.ToInt32(unicode[0]), ";"));
                }
            }
        }
        foreach (var replacement in replacements) {
            html = html.Replace(replacement.Key, replacement.Value);
        }
        return html;
    }

回答by hcoverlambda

This worked for me, replaces both common and unicode entities.

这对我有用,替换了 common 和 unicode 实体。

private static readonly Regex HtmlEntityRegex = new Regex("&(#)?([a-zA-Z0-9]*);");

public static string HtmlDecode(this string html)
{
    if (html.IsNullOrEmpty()) return html;
    return HtmlEntityRegex.Replace(html, x => x.Groups[1].Value == "#"
        ? ((char)int.Parse(x.Groups[2].Value)).ToString()
        : HttpUtility.HtmlDecode(x.Groups[0].Value));
}

[Test]
[TestCase(null, null)]
[TestCase("", "")]
[TestCase("&#39;fark&#39;", "'fark'")]
[TestCase("&quot;fark&quot;", "\"fark\"")]
public void should_remove_html_entities(string html, string expected)
{
    html.HtmlDecode().ShouldEqual(expected);
}

回答by EminST

Improved Zumey method (I can`t comment there). Max char size is in the entity: &exclamation; (11). Upper case in the entities are also possible, ex. À (Source from wiki)

改进的 Zumey 方法(我无法在那里发表评论)。最大字符大小在实体中:&exclamation; (11)。实体中的大写也是可能的,例如。À(来自维基

public string EntityToUnicode(string html) {
        var replacements = new Dictionary<string, string>();
        var regex = new Regex("(&[a-zA-Z]{2,11};)");
        foreach (Match match in regex.Matches(html)) {
            if (!replacements.ContainsKey(match.Value)) { 
                var unicode = HttpUtility.HtmlDecode(match.Value);
                if (unicode.Length == 1) {
                    replacements.Add(match.Value, string.Concat("&#", Convert.ToInt32(unicode[0]), ";"));
                }
            }
        }
        foreach (var replacement in replacements) {
            html = html.Replace(replacement.Key, replacement.Value);
        }
        return html;
    }