在 C# 字符串中搜索特定文本的 HTML 并标记文本的最佳方法是什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/456508/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is the best way to search through HTML in a C# string for specific text and mark the text?
提问by Yttrium
What would be the best way to search through HTML inside a C# string variable to find a specific word/phrase and mark (or wrap) that word/phrase with a highlight?
在 C# 字符串变量中搜索 HTML 以查找特定单词/短语并用突出显示标记(或换行)该单词/短语的最佳方法是什么?
Thanks,
谢谢,
Jeff
杰夫
回答by Eddie Parker
Regular Expression would be my way. ;)
正则表达式将是我的方式。;)
回答by Greg Leaver
Searching for strings, you'll want to look up regular expressions. As for marking it, once you have the position of the substring it should be simple enough to use that to add in something to wrap around the phrase.
搜索字符串时,您需要查找正则表达式。至于标记它,一旦你有了子字符串的位置,它应该足够简单,可以使用它来添加一些东西来环绕短语。
回答by MrTelly
If the HTML you're using XHTML compliant, you could load it as an XML document, and then use XPath/XSL - long winded but kind of elegant?
如果您使用的 HTML 与 XHTML 兼容,您可以将其作为 XML 文档加载,然后使用 XPath/XSL - 冗长但有点优雅?
An approach I used in the past is to use HTMLTidyto convert messy HTML to XHTML, and then use XSL/XPath for screen scraping content into a database, to create a reverse content management system.
我过去使用的一种方法是使用HTMLTidy将凌乱的 HTML 转换为 XHTML,然后使用 XSL/XPath 将屏幕抓取内容放入数据库,以创建反向内容管理系统。
Regular expressions would do it, but could be complicated once you try stripping out tags, image names etc, to remove false positives.
正则表达式可以做到这一点,但是一旦您尝试去除标签、图像名称等以消除误报,就会变得复杂。
回答by Gorkem Pacaci
In simple cases, regular expressions will do.
在简单的情况下,正则表达式就可以了。
string input = "ttttttgottttttt";
string output = Regex.Replace(input, "go", "<strong>$0</strong>");
字符串输入 = "ttttttgottttttt";
string output = Regex.Replace(input, "go", "<strong>$0</strong>");
will yield: "tttttt<strong>go</strong>ttttttt"
将产生:“tttttt<strong>go</strong>ttttttt”
But when you say HTML, if you're referring to final text rendered, that's a bit of a mess. Say you've got this HTML:
但是当你说 HTML 时,如果你指的是最终呈现的文本,那就有点乱了。假设你有这个 HTML:
<span class="firstLetter">B</span>ook
<span class="firstLetter">B</span>好吧
To highlight the word 'Book', you would need the help of a proper HTML renderer. To simplify, one can first remove all tags and leave only contents, and then do the usual replace, but it doesn't feel right.
要突出显示“书”这个词,您需要适当的 HTML 渲染器的帮助。为简化起见,可以先删除所有标签,只留下内容,然后再进行通常的替换,但感觉不太对。
回答by Matthew Dresser
回答by Zen
I like using Html Agility Packvery easy to use, although there hasn't been much updates lately, it is still usable. For example grabbing all the links
我喜欢使用Html Agility Pack非常好用,虽然最近没有太多更新,但它仍然可用。例如抓取所有链接
HtmlWeb client = new HtmlWeb();
HtmlDocument doc = client.Load("http://yoururl.com");
HtmlNodeCollection Nodes = doc.DocumentNode.SelectNodes("//a[@href]");
foreach (var link in Nodes)
{
Console.WriteLine(link.Attributes["href"].Value);
}