C# 正则表达式匹配 HTML 标签并提取文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/299942/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
RegEx matching HTML tags and extracting text
提问by Jon Tackabury
I have a string of test like this:
我有一串这样的测试:
<customtag>hey</customtag>
I want to use a RegEx to modify the text between the "customtag" tags so that it might look like this:
我想使用 RegEx 修改“customtag”标签之间的文本,使其看起来像这样:
<customtag>hey, this is changed!</customtag>
I know that I can use a MatchEvaluator to modify the text, but I'm unsure of the proper RegEx syntax to use. Any help would be much appreciated.
我知道我可以使用 MatchEvaluator 来修改文本,但我不确定要使用正确的 RegEx 语法。任何帮助将非常感激。
采纳答案by Tjofras
I wouldn't use regex either for this, but if you must this expression should work:
<customtag>(.+?)</customtag>
我也不会为此使用正则表达式,但如果您必须使用此表达式,则该表达式应该有效:
<customtag>(.+?)</customtag>
回答by Bill Karwin
I'd chew my own leg off before using a regular expression to parse and alter HTML.
在使用正则表达式解析和更改 HTML 之前,我会先把自己的腿咬掉。
Two comments have asked me to clarify. The regular expression substitution works in the specific case in the OP's question, but in general regular expressions are not a good solution. Regular expressions can match regular languages, i.e. a sequence of input which can be accepted by a finite state machine. HTML can contain nested tags to any arbitrary depth, so it's not a regular language.
有两条评论要求我澄清。正则表达式替换适用于 OP 问题中的特定情况,但通常正则表达式不是一个好的解决方案。正则表达式可以匹配正则语言,即可以被有限状态机接受的输入序列。HTML 可以包含任意深度的嵌套标签,因此它不是常规语言。
What does this have to do with the question? Using a regular expression for the OP's question as it is written works, but what if the content between the <customtag>
tags contains other tags? What if a literal <
character occurs in the text? It has been 11 months since Jon Tackabury asked the question, and I'd guess that in that time, the complexity of his problem may have increased.
这与问题有什么关系?对 OP 的问题使用正则表达式,因为它是书面的,但是如果<customtag>
标签之间的内容包含其他标签怎么办?如果文本中<
出现文字字符怎么办?Jon Tackabury 提出这个问题已经过去了 11 个月,我猜想在那段时间里,他的问题的复杂性可能有所增加。
Regular expressions are great tools and I do use them all the time. But using them in lieu of a real parser for input that needs one is going to work in only very simple cases. It's practically inevitable that these cases grow beyond what regular expressions can handle. When that happens, you'll be tempted to write a more complex regular expression, but these quickly become very laborious to develop and debug. Be ready to scrap the regular expression solution when the parsing requirements expand.
正则表达式是很棒的工具,我确实一直在使用它们。但是使用它们代替真正的解析器来处理需要一个的输入只会在非常简单的情况下起作用。这些情况超出正则表达式的处理能力实际上是不可避免的。发生这种情况时,您会很想编写更复杂的正则表达式,但这些正则表达式的开发和调试很快就会变得非常费力。当解析需求扩大时,准备好废弃正则表达式解决方案。
XSL and DOM are two standard technologies designed to work with XML or XHTML markup. Both technologies know how to parse structured markup files, keep track of nested tags, and allow you to transform tags attributes or content.
XSL 和 DOM 是两种设计用于处理 XML 或 XHTML 标记的标准技术。这两种技术都知道如何解析结构化标记文件、跟踪嵌套标签并允许您转换标签属性或内容。
Here are a couple of articles on how to use XSL with C#:
这里有几篇关于如何在 C# 中使用 XSL 的文章:
- http://www.csharpfriends.com/Articles/getArticle.aspx?articleID=63
- http://www.csharphelp.com/archives/archive78.html
- http://www.csharpfriends.com/Articles/getArticle.aspx?articleID=63
- http://www.csharphelp.com/archives/archive78.html
Here are a couple of articles on how to use DOM with C#:
这里有几篇关于如何在 C# 中使用 DOM 的文章:
- http://msdn.microsoft.com/en-us/library/aa290341%28VS.71%29.aspx
- http://blogs.msdn.com/tims/archive/2007/06/13/programming-html-with-c.aspx
- http://msdn.microsoft.com/en-us/library/aa290341%28VS.71%29.aspx
- http://blogs.msdn.com/tims/archive/2007/06/13/programming-html-with-c.aspx
Here's a .NET library that assists DOM and XSL operations on HTML:
这是一个 .NET 库,可帮助对 HTML 进行 DOM 和 XSL 操作:
回答by Jan Goyvaerts
If there won't be any other tags between the two tags, this regex is a little safer, and more efficient:
如果两个标签之间没有任何其他标签,则此正则表达式更安全,更高效:
<customtag>[^<>]*</customtag>
回答by sajoshi
//This is to replace all HTML Text
var re = new RegExp("<[^>]*>", "g");
var x2 = Content.replace(re,"");
//This is to replace all
var x3 = x2.replace(/\u00a0/g,'');
回答by Jake Drew
Most people use HTML Agility Pack for HTML text parsing. However, I find it a little robust and complicated for my own needs. I create a web browser control in memory, load the page, and copy the text from it. (see example below)
大多数人使用 HTML Agility Pack 进行 HTML 文本解析。但是,我发现它对于我自己的需求来说有点强大和复杂。我在内存中创建了一个 Web 浏览器控件,加载页面,然后从中复制文本。(见下面的例子)
You can find 3 simple examples here:
你可以在这里找到 3 个简单的例子:
http://jakemdrew.wordpress.com/2012/02/03/getting-only-the-text-displayed-on-a-webpage-using-c/
http://jakemdrew.wordpress.com/2012/02/03/getting-only-the-text-displayed-on-a-webpage-using-c/