C# 提取 HTML 正文内容的正则表达式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/356340/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Regular Expression to Extract HTML Body Content
提问by Matthew Ruston
I am looking for a regex statement that will let me extract the HTML content from just between the body tags from a XHTML document.
我正在寻找一个正则表达式语句,它可以让我从 XHTML 文档的 body 标记之间提取 HTML 内容。
The XHTML that I need to parse will be very simple files, I do not have to worry about JavaScript content or <![CDATA[
tags, for example.
我需要解析的 XHTML 将是非常简单的文件,例如,我不必担心 JavaScript 内容或<![CDATA[
标签。
Below is the expected structure of the HTML file is that I have to parse. Since I know exactly all of the content of the HTML files that I am going to have to work with, this HTML snippet pretty much covers my entire use case. If I can get a regex to extract the body of this example, I'll be happy.
以下是我必须解析的 HTML 文件的预期结构。因为我完全知道我将要使用的 HTML 文件的所有内容,所以这个 HTML 片段几乎涵盖了我的整个用例。如果我能得到一个正则表达式来提取这个例子的主体,我会很高兴。
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
</title>
</head>
<body contenteditable="true">
<p>
Example paragraph content
</p>
<p>
</p>
<p>
<br />
</p>
<h1>Header 1</h1>
</body>
</html>
Conceptually, I've been trying to build a regex string that matches everything BUT the inner body content. With this, I would use the C# Regex.Split()
method to obtain the body content. I thought this regex:
从概念上讲,我一直在尝试构建一个正则表达式字符串,该字符串匹配除内部正文内容之外的所有内容。有了这个,我将使用 C#Regex.Split()
方法来获取正文内容。我认为这个正则表达式:
((.|\n)*<body (.)*>)|((</body>(*|\n)*)
...would do the trick, but it doesn't seem to work at all with my test content in RegexBuddy.
...会成功,但它似乎对我在 RegexBuddy 中的测试内容根本不起作用。
采纳答案by VonC
Would this work ?
这行得通吗?
((?:.(?!<body[^>]*>))+.<body[^>]*>)|(</body\>.+)
Of course, you need to add the necessary \s
in order to take into account < body ...>
(element with spaces), as in:
当然,您需要添加必要\s
的以考虑< body ...>
(带空格的元素),如:
((?:.(?!<\s*body[^>]*>))+.<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)
On second thought, I am not sure why I needed a negative look-ahead... This should also work (for a well-formed xhtml document):
再想一想,我不知道为什么我需要一个负面的前瞻......这也应该有效(对于格式良好的 xhtml 文档):
(.*<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)
回答by Karl
XHTML would be more easily parsed with an XML parser, than with a regex. I know it's not what youre asking, but an XML parser would be able to quickly navigate to the body node and give you back its content without any tag mapping problems that the regex is giving you.
使用 XML 解析器比使用正则表达式更容易解析 XHTML。我知道这不是您要问的,但是 XML 解析器将能够快速导航到正文节点并将其内容返回给您,而不会出现正则表达式给您带来的任何标记映射问题。
EDIT: In response to a comment here; that an XML parser is too slow.
编辑:回应这里的评论;XML 解析器太慢了。
There are two kinds of XML parser, one called DOM is big and heavy and easy and friendly, it builds a tree out of the document before you can do anything. The other is called SAX and is fast and light and more work, it reads the file sequentially. You will want SAX to find the Body tag.
有两种 XML 解析器,一种叫做 DOM,它又大又重,又容易又友好,它在你可以做任何事情之前先从文档中构建一棵树。另一种称为 SAX,速度快、重量轻、工作量大,它按顺序读取文件。您将希望 SAX 找到 Body 标记。
The DOM method is good for multiple uses, pulling tags and finding who is what's child. The SAX parser reads across the file in order and qill quickly get the information you are after. The Regex won't be any faster than a SAX parser, because they both simply walk across the file and pattern match, with the exception that the regex won't quit looking after it has found a body tag, because regex has no built in knowledge of XML. In fact, your SAX parser probably uses small pieces of regex to find each tag.
DOM 方法适用于多种用途,拉标签和查找谁是孩子。SAX 解析器按顺序读取整个文件,并快速获取您想要的信息。正则表达式不会比 SAX 解析器快,因为它们都只是遍历文件和模式匹配,除了正则表达式在找到正文标签后不会退出,因为正则表达式没有内置XML 的知识。事实上,您的 SAX 解析器可能使用一小段正则表达式来查找每个标签。
回答by Kev
/<body[^>]*>(.*)</body>/s
replace with
用。。。来代替
回答by bezmax
Why can't you just split it by
为什么你不能把它分开
</{0,1}body[^>]*>
and take the second string? I believe it will be much faster than looking for a huge regexp.
并取第二个字符串?我相信这比寻找一个巨大的正则表达式要快得多。
回答by avinash
String toMatch="aaaaaaaaaaabcxx sldjfkvnlkfd <body>i m avinash</body>";
Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?");
Matcher matcher=pattern.matcher(toMatch);
if(matcher.matches()) {
System.out.println(matcher.group(1));
}
回答by CrazyTim
Match the first body tag: <\s*body.*?>
匹配第一个 body 标签: <\s*body.*?>
Match the last body tag: <\s*/\s*body.*?>
匹配最后一个 body 标签: <\s*/\s*body.*?>
(note: we account for spaces in the middle of the tags, which is completely valid markup btw)
(注意:我们考虑了标签中间的空格,顺便说一句,这是完全有效的标记)
Combine them together like this and you will get everything in-between, including the body tags: <\s*body.*?>.*?<\s*/\s*body.*?>
. And make sure you are using Singleline
mode which will ignore line breaks.
像这样将它们组合在一起,您将获得介于两者之间的所有内容,包括正文标签:<\s*body.*?>.*?<\s*/\s*body.*?>
. 并确保您使用的Singleline
模式将忽略换行符。
This works in VB.NET, and hopefully others too!
这适用于 VB.NET,希望其他人也适用!