从 HTML 中提取文本的正则表达式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/181095/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 22:29:51  来源:igfitidea点击:

regular expression to extract text from HTML

htmlregexhtml-content-extractiontext-extraction

提问by Ron Harlev

I would like to extract from a general HTML page, all the text (displayed or not).

我想从一般 HTML 页面中提取所有文本(显示与否)。

I would like to remove

我想删除

  • any HTML tags
  • Any javascript
  • Any CSS styles
  • 任何 HTML 标签
  • 任何 javascript
  • 任何 CSS 样式

Is there a regular expression (one or more) that will achieve that?

是否有正则表达式(一个或多个)可以实现这一目标?

采纳答案by S.Lott

You can't really parse HTML with regular expressions. It's too complex. RE's won't handle <![CDATA[sections correctly at all. Further, some kinds of common HTML things like &lt;text>will work in a browser as proper text, but might baffle a naive RE.

你不能真正用正则表达式解析 HTML。这太复杂了。RE 根本无法<![CDATA[正确处理部分。此外,某些常见的 HTML 内容(例如)&lt;text>将在浏览器中作为适当的文本工作,但可能会妨碍幼稚的 RE。

You'll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soupto parse HTML and strip out tags and scripts.

使用合适的 HTML 解析器,您会更快乐、更成功。Python 人员经常使用一些Beautiful Soup来解析 HTML 并去除标签和脚本。



Also, browsers, by design, tolerate malformed HTML. So you will often find yourself trying to parse HTML which is clearly improper, but happens to work okay in a browser.

此外,浏览器在设计上可以容忍格式错误的 HTML。所以你经常会发现自己试图解析 HTML,这显然是不正确的,但恰好在浏览器中工作正常。

You might be able to parse bad HTML with RE's. All it requires is patience and hard work. But it's often simpler to use someone else's parser.

您也许可以使用 RE 解析错误的 HTML。它所需要的只是耐心和努力工作。但是使用别人的解析器通常更简单。

回答by nickf

Remove javascript and CSS:

删除 javascript 和 CSS:

<(script|style).*?</>

Remove tags

删除标签

<.*?>

回答by Joe Bergevin

Needed a regex solution (in php) that would return the plain text just as well (or better than) PHPSimpleDOM, only much faster. Here is the solution that I came up with:

需要一个正则表达式解决方案(在 php 中),它将返回纯文本与 PHPSimpleDOM 一样(或更好),但速度要快得多。这是我想出的解决方案:

function plaintext($html)
{
    // remove comments and any content found in the the comment area (strip_tags only removes the actual tags).
    $plaintext = preg_replace('#<!--.*?-->#s', '', $html);

    // put a space between list items (strip_tags just removes the tags).
    $plaintext = preg_replace('#</li>#', ' </li>', $plaintext);

    // remove all script and style tags
    $plaintext = preg_replace('#<(script|style)\b[^>]*>(.*?)</(script|style)>#is', "", $plaintext);

    // remove br tags (missed by strip_tags)
    $plaintext = preg_replace("#<br[^>]*?>#", " ", $plaintext);

    // remove all remaining html
    $plaintext = strip_tags($plaintext);

    return $plaintext;
}

When I tested this on some complicated sites (forums seem to contain some of the tougher html to parse), this method returned the same result as PHPSimpleDOM plaintext, only much, much faster. It also handled the list items (li tags) properly, where PHPSimpleDOM did not.

当我在一些复杂的站点(论坛似乎包含一些更难解析的 html 解析)上对此进行测试时,此方法返回与 PHPSimpleDOM 纯文本相同的结果,只是速度要快得多。它还正确处理了列表项(li 标签),而 PHPSimpleDOM 则没有。

As for the speed:

至于速度:

  • SimpleDom: 0.03248 sec.
  • RegEx: 0.00087 sec.
  • SimpleDom:0.03248 秒。
  • 正则表达式:0.00087 秒。

37 times faster!

快了 37 倍!

回答by Chris Noe

Contemplating doing this with regular expressions is daunting. Have you considered XSLT? The XPath expression to extract all of the text nodes in an XHTML document, minus script & style content, would be:

考虑使用正则表达式执行此操作令人生畏。你考虑过 XSLT 吗?提取 XHTML 文档中所有文本节点的 XPath 表达式(减去脚本和样式内容)将是:

//body//text()[not(ancestor::script)][not(ancestor::style)]

回答by Matthew Scharley

Using perl syntax for defining the regexes, a start might be:

使用 perl 语法定义正则表达式,开始可能是:

!<body.*?>(.*)</body>!smi

Then applying the following replace to the result of that group:

然后将以下替换应用于该组的结果:

!<script.*?</script>!!smi
!<[^>]+/[ \t]*>!!smi
!</?([a-z]+).*?>!!smi
/<!--.*?-->//smi

This of course won't format things nicely as a text file, but it strip out all the HTML (mostly, there's a few cases where it might not work quite right). A better idea though is to use an XML parser in whatever language you are using to parse the HTML properly and extract the text out of that.

这当然不会很好地将内容格式化为文本文件,但它会删除所有 HTML(大多数情况下,它可能无法正常工作)。不过,更好的主意是使用您使用的任何语言的 XML 解析器来正确解析 HTML 并从中提取文本。

回答by David Avsajanishvili

The simplest way for simple HTML (example in Python):

简单 HTML 的最简单方法(Python 中的示例):

text = "<p>This is my> <strong>example</strong>HTML,<br /> containing tags</p>"
import re
" ".join([t.strip() for t in re.findall(r"<[^>]+>|[^<]+",text) if not '<' in t])

Returns this:

返回这个:

'This is my> example HTML, containing tags'

回答by Ayush

Here's a function to remove even most complex html tags.

这是一个删除最复杂的 html 标签的函数。

function strip_html_tags( $text ) 
{

$text = preg_replace(
    array(
        // Remove invisible content
        '@<head[^>]*?>.*?</head>@siu',
        '@<style[^>]*?>.*?</style>@siu',
        '@<script[^>]*?.*?</script>@siu',
        '@<object[^>]*?.*?</object>@siu',
        '@<embed[^>]*?.*?</embed>@siu',
        '@<applet[^>]*?.*?</applet>@siu',
        '@<noframes[^>]*?.*?</noframes>@siu',
        '@<noscript[^>]*?.*?</noscript>@siu',
        '@<noembed[^>]*?.*?</noembed>@siu',

        // Add line breaks before & after blocks
        '@<((br)|(hr))@iu',
        '@</?((address)|(blockquote)|(center)|(del))@iu',
        '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
        '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
        '@</?((table)|(th)|(td)|(caption))@iu',
        '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
        '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
        '@</?((frameset)|(frame)|(iframe))@iu',
    ),
    array(
        ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
        "\n$0", "\n$0", "\n$0", "\n$0", "\n$0", "\n$0",
        "\n$0", "\n$0",
    ),
    $text );

// Remove all remaining tags and comments and return.
return strip_tags( $text );
    }

回答by Shiroy

Can't you just use the WebBrowser control available with C# ?

你不能只使用 C# 提供的 WebBrowser 控件吗?

        System.Windows.Forms.WebBrowser wc = new System.Windows.Forms.WebBrowser();
        wc.DocumentText = "<html><body>blah blah<b>foo</b></body></html>";
        System.Windows.Forms.HtmlDocument h = wc.Document;
        Console.WriteLine(h.Body.InnerText);

回答by mahesh

string decode = System.Web.HttpUtility.HtmlDecode(your_htmlfile.html);
                Regex objRegExp = new Regex("<(.|\n)+?>");
                string replace = objRegExp.Replace(g, "");
                replace = replace.Replace(k, string.Empty);
                replace.Trim("\t\r\n ".ToCharArray());

then take a label and do "label.text=replace;" see on label out put

.

.

回答by Robert Elwell

If you're using PHP, try Simple HTML DOM, available at SourceForge.

如果您使用的是 PHP,请尝试使用 SourceForge 上的 Simple HTML DOM。

Otherwise, Google html2text, and you'll find a variety of implementations for different languages that basically use a series of regular expressions to suck out all the markup. Be careful here, because tags without endings can sometimes be left in, as well as special characters such as & (which is &amp;).

否则,谷歌 html2text,你会发现不同语言的各种实现,基本上使用一系列正则表达式来吸出所有标记。在这里要小心,因为有时会留下没有结尾的标签,以及诸如 &(即 &)之类的特殊字符。

Also, watch out for comments and Javascript, as I've found it's particularly annoying to deal with for regular expressions, and why I generally just prefer to let a free parser do all the work for me.

另外,请注意注释和 Javascript,因为我发现处理正则表达式特别烦人,以及为什么我通常更喜欢让免费解析器为我完成所有工作。