javascript 如何使用正则表达式提取正文内容
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3628374/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to extract body contents using regexp
提问by faressoft
I have this code in a var.
我在 var 中有这段代码。
<html>
<head>
.
.
anything
.
.
</head>
<body anything="">
content
</body>
</html>
or
或者
<html>
<head>
.
.
anything
.
.
</head>
<body>
content
</body>
</html>
result should be
结果应该是
content
回答by Jeffrey Blake
Note that the string-based answers supplied above should work in most cases. The one major advantage offered by a regexsolution is that you can more easily provide for a case-insensitive matchon the open/close body tags. If that is not a concern to you, then there's no major reason to use regex here.
请注意,上面提供的基于字符串的答案应该适用于大多数情况。正则表达式解决方案提供的一个主要优点是您可以更轻松地在打开/关闭正文标签上提供不区分大小写的匹配。如果这不是您关心的问题,那么在这里使用正则表达式就没有什么大不了的。
And for the people who see HTML and regex together and throw a fit...Since you are not actually trying to parse HTML with this, it is something you can do with regular expressions. If, for some reason, contentcontained </body>then it would fail, but aside from that, you have a sufficiently specific scenario that regular expressions are capable of doing what you want:
对于那些同时看到 HTML 和正则表达式的人来说,他们感到很不舒服……因为您实际上并没有尝试用这个来解析 HTML,所以您可以使用正则表达式来做一些事情。如果由于某种原因被content包含,</body>那么它会失败,但除此之外,你有一个足够具体的场景,正则表达式能够做你想做的事:
const strVal = yourStringValue; //obviously, this line can be omitted - just assign your string to the name strVal or put your string var in the pattern.exec call below
const pattern = /<body[^>]*>((.|[\n\r])*)<\/body>/im;
const array_matches = pattern.exec(strVal);
After the above executes, array_matches[1]will hold whatever came between the <bodyand </body>tags.
上述执行后,array_matches[1]将保留<body和</body>标签之间的任何内容。
回答by Catalin Enache
var matched = XMLHttpRequest.responseText.match(/<body[^>]*>([\w|\W]*)<\/body>/im);
alert(matched[1]);
回答by Doug
I believe you can load your html document into the .net HTMLDocument object and then simply call the HTMLDocument.body.innerHTML?
我相信您可以将 html 文档加载到 .net HTMLDocument 对象中,然后简单地调用 HTMLDocument.body.innerHTML?
I am sure there is even and easier way with the newer XDocumnet as well.
我相信更新的 XDocumnet 也有更简单的方法。
And just to echo some of the comments above regex is not the best tool to use as html is not a regular language and there are some edge cases that are difficult to solve for.
只是为了回应上面的一些评论 regex 不是最好的工具,因为 html 不是常规语言,并且有一些难以解决的边缘情况。
https://en.wikipedia.org/wiki/Regular_language
https://en.wikipedia.org/wiki/Regular_language
Enjoy!
享受!

