C# 正则表达式提取 html 正文

Question

提问by Bruce Adams

How would I use Regex to extract the body from a html doc, taking into account that the html and body tags might be in uppercase, lowercase or might not exist?

考虑到 html 和 body 标签可能是大写、小写或可能不存在，我将如何使用 Regex 从 html 文档中提取正文？

Answer 1

采纳答案by Andrew Hare

Don't use a regular expression for this - use something like the Html Agility Pack.

不要为此使用正则表达式 - 使用Html Agility Pack 之类的东西。

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

这是一个敏捷的 HTML 解析器，它构建了一个读/写 DOM 并支持普通的 XPATH 或 XSLT（你实际上不必了解 XPATH 或 XSLT 来使用它，别担心......）。它是一个 .NET 代码库，允许您解析“网络之外”的 HTML 文件。解析器对“现实世界”格式错误的 HTML 非常宽容。对象模型与 System.Xml 建议的非常相似，但用于 HTML 文档（或流）。

Then you can extract the bodywith an XPATH.

然后您可以body使用 XPATH提取。

Answer 2

回答by Jeremy Stein

This should get you pretty close:

这应该让你非常接近：

(?is)<body(?:\s[^>]*)>(.*?)(?:</\s*body\s*>|</\s*html\s*>|$)

Answer 3

回答by Darryl

How about something like this?

这样的事情怎么样？

It captures everything between <body></body>tags (case insensitive due to RegexOptions.IgnoreCase) into a group named theBody.

它将<body></body>标签之间的所有内容（由于不区分大小写RegexOptions.IgnoreCase）捕获到名为的组中theBody。

RegexOptions.Singlelineallows us to handle multiline HTML as a single string.

RegexOptions.Singleline允许我们将多行 HTML 作为单个字符串处理。

If the HTML does not contain <body></body>tags, the Successproperty of the match will be false.

如果 HTML 不包含<body></body>标签，Success则匹配的属性将为 false。

        string html;

        // Populate the html string here

        RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Singleline;
        Regex regx = new Regex( "<body>(?<theBody>.*)</body>", options );

        Match match = regx.Match( html );

        if ( match.Success ) {
            string theBody = match.Groups["theBody"].Value;
        }

C# 正则表达式提取 html 正文

提问by Bruce Adams

采纳答案by Andrew Hare

回答by Jeremy Stein

回答by Darryl

相关推荐

最近更新

标签

C# 正则表达式提取 html 正文

提问by Bruce Adams

采纳答案by Andrew Hare

回答by Jeremy Stein

回答by Darryl

相关推荐

Android Debug Bridge (adb) 命令行工具存在于 $PATH 中，但在 linux 中“找不到命令”

linux中的后台进程

C# 如何声明嵌套枚举？

如何在vc++中使用c#Dll？

相关推荐

最近更新

标签