正则表达式匹配除 和 之外的所有 HTML 标签

Question

提问by Xetius

I need to match and remove all tags using a regular expression in Perl. I have the following:

我需要在 Perl 中使用正则表达式匹配和删除所有标签。我有以下几点：

<\??(?!p).+?>

But this still matches with the closing tag. Any hint on how to match with the closing tag as well?

但这仍然与结束标记匹配。关于如何与结束标签匹配的任何提示？

Note, this is being performed on xhtml.

请注意，这是在 xhtml 上执行的。

Answer 1

采纳答案by Xetius

I came up with this:

我想出了这个：

<(?!\/?p(?=>|\s.*>))\/?.*?>

x/
<           # Match open angle bracket
(?!         # Negative lookahead (Not matching and not consuming)
    \/?     # 0 or 1 /
    p           # p
    (?=     # Positive lookahead (Matching and not consuming)
    >       # > - No attributes
        |       # or
    \s      # whitespace
    .*      # anything up to 
    >       # close angle brackets - with attributes
    )           # close positive lookahead
)           # close negative lookahead
            # if we have got this far then we don't match
            # a p tag or closing p tag
            # with or without attributes
\/?         # optional close tag symbol (/)
.*?         # and anything up to
>           # first closing tag
/

This will now deal with p tags with or without attributes and the closing p tags, but will match pre and similar tags, with or without attributes.

这将处理带有或不带有属性的 p 标签和结束 p 标签，但将匹配 pre 和类似标签，带有或不带有属性。

It doesn't strip out attributes, but my source data does not put them in. I may change this later to do this, but this will suffice for now.

它不会去除属性，但我的源数据没有将它们放入。我可能稍后会更改它以执行此操作，但现在就足够了。

Answer 2

回答by John Siracusa

If you insiston using a regex, something like this will work in most cases:

如果你坚持使用正则表达式，在大多数情况下，这样的事情会起作用：

# Remove all HTML except "p" tags
$html =~ s{<(?>/?)(?:[^pP]|[pP][^\s>/])[^>]*>}{}g;

Explanation:

解释：

s{
  <             # opening angled bracket
  (?>/?)        # ratchet past optional / 
  (?:
    [^pP]       # non-p tag
    |           # ...or...
    [pP][^\s>/] # longer tag that begins with p (e.g., <pre>)
  )
  [^>]*         # everything until closing angled bracket
  >             # closing angled bracket
 }{}gx; # replace with nothing, globally

But really, save yourself some headaches and use a parser instead. CPAN has several modules that are suitable. Here's an example using the HTML::TokeParsermodule that comes with the extremely capable HTML::ParserCPAN distribution:

但实际上，避免让自己头疼，而是使用解析器。CPAN 有几个合适的模块。这是一个使用HTML::TokeParser模块的示例，该模块与功能极其强大的HTML::ParserCPAN 发行版一起提供：

use strict;

use HTML::TokeParser;

my $parser = HTML::TokeParser->new('/some/file.html')
  or die "Could not open /some/file.html - $!";

while(my $t = $parser->get_token)
{
  # Skip start or end tags that are not "p" tags
  next  if(($t->[0] eq 'S' || $t->[0] eq 'E') && lc $t->[1] ne 'p');

  # Print everything else normally (see HTML::TokeParser docs for explanation)
  if($t->[0] eq 'T')
  {
    print $t->[1];
  }
  else
  {
    print $t->[-1];
  }
}

HTML::Parseraccepts input in the form of a file name, an open file handle, or a string. Wrapping the above code in a library and making the destination configurable (i.e., not just printing as in the above) is not hard. The result will be much more reliable, maintainable, and possibly also faster (HTML::Parser uses a C-based backend) than trying to use regular expressions.

HTML::Parser接受文件名、打开的文件句柄或字符串形式的输入。将上面的代码包装在一个库中并使目标可配置（即，不仅仅是print上面的 ing）并不难。与尝试使用正则表达式相比，结果将更加可靠、可维护，并且可能更快（HTML::Parser 使用基于 C 的后端）。

Answer 3

回答by J?rg W Mittag

In my opinion, trying to parse HTML with anything other than an HTML parser is just asking for a world of pain. HTML is a reallycomplex language (which is one of the major reasons that XHTML was created, which is much simpler than HTML).

在我看来，试图用 HTML 解析器以外的任何东西来解析 HTML 只是在寻找一个痛苦的世界。HTML 是一种非常复杂的语言（这是创建 XHTML 的主要原因之一，它比 HTML 简单得多）。

For example, this:

例如，这个：

<HTML /
  <HEAD /
    <TITLE / > /
    <P / >

is a complete, 100% well-formed, 100% valid HTML document. (Well, it's missing the DOCTYPE declaration, but other than that ...)

是一个完整的、100% 格式良好、100% 有效的 HTML 文档。（好吧，它缺少 DOCTYPE 声明，但除此之外......）

It is semantically equivalent to

它在语义上等同于

<html>
  <head>
    <title>
      &gt;
    </title>
  </head>
  <body>
    <p>
      &gt;
    </p>
  </body>
</html>

But it's nevertheless valid HTML that you're going to have to deal with. You could, of course, devise a regex to parse it, but, as others already suggested, using an actual HTML parser is just sooo much easier.

但它仍然是您必须处理的有效 HTML。当然，您可以设计一个正则表达式来解析它，但是，正如其他人已经建议的那样，使用实际的 HTML 解析器要容易得多。

Answer 4

回答by dbr

Not sure why you are wanting to do this - regex for HTML sanitisation isn't always the best method (you need to remember to sanitise attributes and such, remove javascript: hrefs and the likes)... but, a regex to match HTML tags that aren't :

不知道你为什么要这样做 - 用于 HTML 清理的正则表达式并不总是最好的方法（你需要记住清理属性等，删除 javascript: hrefs 等）......但是，匹配 HTML 的正则表达式不是的标签：

(<[^pP].*?>|</[^pP]>)

Verbose:

详细：

(
    <               # < opening tag
        [^pP].*?    # p non-p character, then non-greedy anything
    >               # > closing tag
|                   #   ....or....
    </              # </
        [^pP]       # a non-p tag
    >               # >
)

Answer 5

回答by y_nk

I used Xetius regex and it works fine. Except for some flex generated tags which can be :
with no spaces inside. I tried ti fix it with a simple ?after \sand it looks like it's working :

我使用了 Xetius 正则表达式，它工作正常。除了一些 flex 生成的标签，它们可以是 :
里面没有空格。我试过用一个简单的? 在\s之后，看起来它正在工作：

<(?!\/?p(?=>|\s?.*>))\/?.*?>

I'm using it to clear tags from flex generated html text so i also added more excepted tags :

我用它来清除 flex 生成的 html 文本中的标签，所以我还添加了更多例外标签：

<(?!\/?(p|a|b|i|u|br)(?=>|\s?.*>))\/?.*?>

Answer 6

回答by zx81

Xetius, resurrecting this ancient question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)

Xetius，复活这个古老的问题，因为它有一个没有提到的简单解决方案。（在为正则表达式赏金任务做一些研究时发现了您的问题。）

With all the disclaimers about using regex to parse html, here is a simple way to do it.

关于使用正则表达式解析 html 的所有免责声明，这里有一个简单的方法来做到这一点。

#!/usr/bin/perl
$regex = '(<\/?p[^>]*>)|<[^>]*>';
$subject = 'Bad html <a> </I> <p>My paragraph</p> <i>Italics</i> <p class="blue">second</p>';
($replaced = $subject) =~ s/$regex//eg;
print $replaced . "\n";

See this live demo

看这个现场演示

Reference

参考

How to match pattern except in situations s1, s2, s3

除了 s1、s2、s3 的情况外，如何匹配模式

How to match a pattern unless...

如何匹配模式，除非...

Answer 7

回答by DrPizza

Since HTML is not a regular language I would not expect a regular expression to do a very good job at matching it. They might be up to this task (though I'm not convinced), but I would consider looking elsewhere; I'm sure perl must have some off-the-shelf libraries for manipulating HTML.

由于 HTML 不是正则语言，我不希望正则表达式能够很好地匹配它。他们可能会胜任这项任务（虽然我不相信），但我会考虑寻找其他地方；我确信 perl 必须有一些现成的库来操作 HTML。

Anyway, I would think that what you want to match is </?(p.+|.*)(\s*.*)> non-greedily (I don't know the vagaries of perl's regexp syntax so I cannot help further). I am assuming that \s means whitespace. Perhaps it doesn't. Either way, you want something that'll match attributes offset from the tag name by whitespace. But it's more difficult than that as people often put unescaped angle brackets inside scripts and comments and perhaps even quoted attribute values, which you don't want to match against.

无论如何，我认为你想要匹配的是 </?(p.+|.*)(\s*.*)> 非贪婪（我不知道 perl 正则表达式语法的变幻莫测，所以我无能为力更远）。我假设 \s 表示空格。也许不是。无论哪种方式，您都需要一些匹配从标签名称偏移空格的属性的东西。但它比这更困难，因为人们经常在脚本和注释中放置未转义的尖括号，甚至可能引用您不想与之匹配的属性值。

So as I say, I don't really think regexps are the right tool for the job.

所以正如我所说，我真的不认为正则表达式是适合这项工作的工具。

Answer 8

回答by Konrad Rudolph

Since HTML is not a regular language

由于 HTML 不是常规语言

HTML isn't but HTML tags are and they can be adequatly described by regular expressions.

HTML 不是，但 HTML 标签是，它们可以通过正则表达式充分描述。

Answer 9

回答by Brian Warshaw

Assuming that this will work in PERL as it does in languages that claim to use PERL-compatible syntax:

假设这将在 PERL 中工作，就像在声称使用 PERL 兼容语法的语言中一样：

/<\/?[^p][^>]*>/

EDIT:

编辑：

But that won't match a <pre>or <param>tag, unfortunately.

但不幸的是，这与<pre>or<param>标签不匹配。

This, perhaps?

这，也许？

/<\/?(?!p>|p )[^>]+>/

That should cover tags that have attributes, too.

这也应该涵盖具有属性的标签。

Answer 10

回答by Kibbee

You also might want to allow for whitespace before the "p" in the p tag. Not sure how often you'll run into this, but is perfectly valid HTML.

您可能还希望在 p 标记中的“p”之前允许有空格。不确定您多久会遇到这种情况，但是 是完全有效的 HTML。

正则表达式匹配除 <p> 和 </p> 之外的所有 HTML 标签

提问by Xetius

采纳答案by Xetius

回答by John Siracusa

回答by J?rg W Mittag

回答by dbr

回答by y_nk

回答by zx81

回答by DrPizza

回答by Konrad Rudolph

回答by Brian Warshaw

回答by Kibbee

相关推荐

最近更新

标签

正则表达式匹配除 <p> 和 </p> 之外的所有 HTML 标签

提问by Xetius

采纳答案by Xetius

回答by John Siracusa

回答by J?rg W Mittag

回答by dbr

回答by y_nk

回答by zx81

回答by DrPizza

回答by Konrad Rudolph

回答by Brian Warshaw

回答by Kibbee

相关推荐

Html CSS：浮动 div 的高度为 0

Html 仅使用 CSS 进行视差滚动？

Html 如何自定义有序列表中的数字？

Html 如何强制 Facebook 清除其缓存并使用共享网页的更新元描述？

相关推荐

最近更新

标签