C# 正则表达式提取 html 正文
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/982510/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Regex Extract html Body
提问by Bruce Adams
How would I use Regex to extract the body from a html doc, taking into account that the html and body tags might be in uppercase, lowercase or might not exist?
考虑到 html 和 body 标签可能是大写、小写或可能不存在,我将如何使用 Regex 从 html 文档中提取正文?
采纳答案by Andrew Hare
Don't use a regular expression for this - use something like the Html Agility Pack.
不要为此使用正则表达式 - 使用Html Agility Pack 之类的东西。
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
这是一个敏捷的 HTML 解析器,它构建了一个读/写 DOM 并支持普通的 XPATH 或 XSLT(你实际上不必了解 XPATH 或 XSLT 来使用它,别担心......)。它是一个 .NET 代码库,允许您解析“网络之外”的 HTML 文件。解析器对“现实世界”格式错误的 HTML 非常宽容。对象模型与 System.Xml 建议的非常相似,但用于 HTML 文档(或流)。
Then you can extract the body
with an XPATH.
然后您可以body
使用 XPATH提取。
回答by Jeremy Stein
This should get you pretty close:
这应该让你非常接近:
(?is)<body(?:\s[^>]*)>(.*?)(?:</\s*body\s*>|</\s*html\s*>|$)
回答by Darryl
How about something like this?
这样的事情怎么样?
It captures everything between <body></body>
tags (case insensitive due to RegexOptions.IgnoreCase
) into a group named theBody
.
它将<body></body>
标签之间的所有内容(由于 不区分大小写RegexOptions.IgnoreCase
)捕获到名为 的组中theBody
。
RegexOptions.Singleline
allows us to handle multiline HTML as a single string.
RegexOptions.Singleline
允许我们将多行 HTML 作为单个字符串处理。
If the HTML does not contain <body></body>
tags, the Success
property of the match will be false.
如果 HTML 不包含<body></body>
标签,Success
则匹配的属性将为 false。
string html;
// Populate the html string here
RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Singleline;
Regex regx = new Regex( "<body>(?<theBody>.*)</body>", options );
Match match = regx.Match( html );
if ( match.Success ) {
string theBody = match.Groups["theBody"].Value;
}