如何使用 Perl 去除字符串中的 HTML?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1067414/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 00:04:00  来源:igfitidea点击:

How can I strip HTML in a string using Perl?

htmlregexperlstrip

提问by ParoX

Is there anyway easier than this to strip HTML from a string using Perl?

有没有比这更容易使用 Perl 从字符串中去除 HTML 的方法?

$Error_Msg =~ s|<b>||ig;
$Error_Msg =~ s|</b>||ig;
$Error_Msg =~ s|<h1>||ig;
$Error_Msg =~ s|</h1>||ig;
$Error_Msg =~ s|<br>||ig;

I would appreicate both a slimmed down regular expression, e.g. something like this:

我会欣赏一个精简的正则表达式,例如这样的东西:

$Error_Msg =~ s|</?[b|h1|br]>||ig;

Is there an existing Perl function that strips any/all HTML from a string, even though I only need bolds, h1 headers and br stripped?

是否有现有的 Perl 函数可以从字符串中去除任何/所有 HTML,即使我只需要粗体、h1 标题和 br 去除?

回答by Abhinav Gupta

Assuming the code is valid HTML (no stray < or > operators)

假设代码是有效的 HTML(没有杂散的 < 或 > 运算符)

$htmlCode =~ s|<.+?>||g;

If you need to remove only bolds, h1's and br's

如果您只需要删除粗体、h1 和 br

$htmlCode =~ s#</?(?:b|h1|br)\b.*?>##g

And you might want to consider the HTML::Stripmodule

你可能想考虑HTML::Strip模块

回答by brian d foy

From perlfaq9: How do I remove HTML from a string?

来自perlfaq9:如何从字符串中删除 HTML?



The most correct way (albeit not the fastest) is to use HTML::Parser from CPAN. Another mostly correct way is to use HTML::FormatText which not only removes HTML but also attempts to do a little simple formatting of the resulting plain text.

最正确的方法(虽然不是最快的)是使用 CPAN 中的 HTML::Parser。另一种最正确的方法是使用 HTML::FormatText ,它不仅删除 HTML,而且还尝试对生成的纯文本进行一些简单的格式化。

Many folks attempt a simple-minded regular expression approach, like s/<.*?>//g, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus, folks forget to convert entities--like < for example.

许多人尝试使用简单的正则表达式方法,例如 s/<.*?>//g,但在很多情况下都失败了,因为标签可能会在换行符处继续,它们可能包含带引号的尖括号,或者 HTML 注释可能出席。另外,人们忘记转换实体——例如 <。

Here's one "simple-minded" approach, that works for most files:

这是一种“简单”的方法,适用于大多数文件:

#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?)*>//gs

If you want a more complete solution, see the 3-stage striphtml program in http://www.cpan.org/authors/id/T/TO/TOMC/scripts/striphtml.gz.

如果您想要更完整的解决方案,请参阅http://www.cpan.org/authors/id/T/TO/TOMC/scripts/striphtml.gz 中的 3-stage striphtml 程序。

Here are some tricky cases that you should think about when picking a solution:

以下是您在选择解决方案时应该考虑的一些棘手案例:

<IMG SRC = "foo.gif" ALT = "A > B">

<IMG SRC = "foo.gif"
 ALT = "A > B">

<!-- <A comment> -->

<script>if (a<b && a>c)</script>

<# Just data #>

<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

If HTML comments include other tags, those solutions would also break on text like this:

如果 HTML 注释包含其他标签,这些解决方案也会在文本上中断,如下所示:

<!-- This section commented out.
    <B>You can't see me!</B>
-->

回答by Juan A. Navarro

You should definitely have a look at the HTML::Restrictwhich allows you to strip away or restrict the HTML tags allowed. A minimal example that strips away all HTML tags:

你绝对应该看看HTML::Restrict允许你剥离或限制允许的 HTML 标签。剥离所有 HTML 标签的最小示例:

use HTML::Restrict;

my $hr = HTML::Restrict->new();
my $processed = $hr->process('<b>i am bold</b>'); # returns 'i am bold'

I would recommend to stay away from HTML::Strip because it breaks utf8 encoding.

我建议远离 HTML::Strip 因为它破坏了 utf8 encoding