php 从 HTML 内容中删除脚本标签

Question

提问by I-M-JM

I am using HTML Purifier (http://htmlpurifier.org/)

我正在使用 HTML Purifier (http://htmlpurifier.org/)

I just want to remove <script>tags only. I don't want to remove inline formatting or any other things.

我只想删除<script>标签。我不想删除内联格式或任何其他内容。

How can I achieve this?

我怎样才能做到这一点？

One more thing, it there any other way to remove script tags from HTML

还有一件事，还有其他方法可以从 HTML 中删除脚本标签

Answer 1

回答by Dejan Marjanovic

Because this question is tagged with regexI'm going to answer with poor man's solution in this situation:

因为这个问题是用正则表达式标记的，所以在这种情况下，我将用穷人的解决方案来回答：

$html = preg_replace('#<script(.*?)>(.*?)</script>#is', '', $html);

However, regular expressions are not for parsing HTML/XML, even if you write the perfectexpression it will break eventually, it's not worth it, although, in some cases it's useful to quickly fix some markup, and as it is with quick fixes, forget about security. Use regex only on content/markup you trust.

但是，正则表达式不是用于解析 HTML/XML，即使您编写了完美的表达式，它最终也会崩溃，但这并不值得，尽管在某些情况下，快速修复某些标记很有用，并且就像快速修复一样，忘记安全。仅在您信任的内容/标记上使用正则表达式。

Remember, anything that user inputs should be considered not safe.

请记住，用户输入的任何内容都应被视为不安全。

Bettersolution here would be to use DOMDocumentwhich is designed for this. Here is a snippet that demonstrate how easy, clean (compared to regex), (almost) reliable and (nearly) safe is to do the same:

这里更好的解决方案是使用DOMDocument专为此设计的。这是一个片段，演示了执行相同操作是多么容易、干净（与正则表达式相比）、（几乎）可靠和（几乎）安全：

<?php

$html = <<<HTML
...
HTML;

$dom = new DOMDocument();

$dom->loadHTML($html);

$script = $dom->getElementsByTagName('script');

$remove = [];
foreach($script as $item)
{
  $remove[] = $item;
}

foreach ($remove as $item)
{
  $item->parentNode->removeChild($item); 
}

$html = $dom->saveHTML();

I have removed the HTML intentionally because even this can bork.

我有意删除了 HTML，因为即使这样也可以bork。

Answer 2

回答by Alex

Use the PHP DOMDocumentparser.

使用 PHPDOMDocument解析器。

$doc = new DOMDocument();

// load the HTML string we want to strip
$doc->loadHTML($html);

// get all the script tags
$script_tags = $doc->getElementsByTagName('script');

$length = $script_tags->length;

// for each tag, remove it from the DOM
for ($i = 0; $i < $length; $i++) {
  $script_tags->item($i)->parentNode->removeChild($script_tags->item($i));
}

// get the HTML string back
$no_script_html_string = $doc->saveHTML();

This worked me me using the following HTML document:

这让我使用以下 HTML 文档：

<!doctype html>
<html>
    <head>
        <meta charset="utf-8">
        <title>
            hey
        </title>
        <script>
            alert("hello");
        </script>
    </head>
    <body>
        hey
    </body>
</html>

Just bear in mind that the DOMDocumentparser requires PHP 5 or greater.

请记住，DOMDocument解析器需要 PHP 5 或更高版本。

Answer 3

回答by prasanthnv

$html = <<<HTML
...
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags_to_remove = array('script','style','iframe','link');
foreach($tags_to_remove as $tag){
    $element = $dom->getElementsByTagName($tag);
    foreach($element  as $item){
        $item->parentNode->removeChild($item);
    }
}
$html = $dom->saveHTML();

Answer 4

回答by José Carlos PHP

A simple way by manipulating string.

通过操作字符串的简单方法。

$str = stripStr($str, '<script', '</script>');

function stripStr($str, $ini, $fin)
{
    while(($pos = mb_stripos($str, $ini)) !== false)
    {
        $aux = mb_substr($str, $pos + mb_strlen($ini));
        $str = mb_substr($str, 0, $pos).mb_substr($aux, mb_stripos($aux, $fin) + mb_strlen($fin));
    }

    return $str;
}

Answer 5

回答by tech-e

An example modifing ctf0's answer. This should only do the preg_replace once but also check for errors and block char code for forward slash.

修改 ctf0 答案的示例。这应该只执行 preg_replace 一次，但还要检查错误并阻止正斜杠的字符代码。

$str = '<script> var a - 1; <&#47;script>'; 

$pattern = '/(script.*?(?:\/|&#47;|&#x0002F;)script)/ius';
$replace = preg_replace($pattern, '', $str); 
return ($replace !== null)? $replace : $str;

If you are using php 7 you can use the null coalesce operator to simplify it even more.

如果您使用的是 php 7，您可以使用 null 合并运算符来进一步简化它。

$pattern = '/(script.*?(?:\/|&#47;|&#x0002F;)script)/ius'; 
return (preg_replace($pattern, '', $str) ?? $str);

Answer 6

回答by ClandestineCoder

I had been struggling with this question. I discovered you only really need one function. explode('>', $html); The single common denominator to any tag is < and >. Then after that it's usually quotation marks ( " ). You can extract information so easily once you find the common denominator. This is what I came up with:

我一直在为这个问题而苦苦挣扎。我发现你真的只需要一个功能。爆炸（'>'，$ html）；任何标签的单一公分母是 < 和 >。然后之后通常是引号（ " ）。一旦找到公分母，您就可以轻松提取信息。这就是我想出的：

$html = file_get_contents('http://some_page.html');

$h = explode('>', $html);

foreach($h as $k => $v){

    $v = trim($v);//clean it up a bit

    if(preg_match('/^(<script[.*]*)/ius', $v)){//my regex here might be questionable

        $counter = $k;//match opening tag and start counter for backtrace

        }elseif(preg_match('/([.*]*<\/script$)/ius', $v)){//but it gets the job done

            $script_length = $k - $counter;

            $counter = 0;

            for($i = $script_length; $i >= 0; $i--){
                $h[$k-$i] = '';//backtrace and clear everything in between
                }
            }           
        }
for($i = 0; $i <= count($h); $i++){
    if($h[$i] != ''){
    $ht[$i] = $h[$i];//clean out the blanks so when we implode it works right.
        }
    }
$html = implode('>', $ht);//all scripts stripped.


echo $html;

I see this really only working for script tags because you will never have nested script tags. Of course, you can easily add more code that does the same check and gather nested tags.

我认为这实际上只适用于脚本标签，因为您永远不会有嵌套的脚本标签。当然，您可以轻松添加更多代码来执行相同的检查并收集嵌套标签。

I call it accordion coding. implode();explode(); are the easiest ways to get your logic flowing if you have a common denominator.

我称之为手风琴编码。内爆（）；爆炸（）；如果你有一个共同点，这是让你的逻辑流畅的最简单的方法。

Answer 7

回答by Binh WPO

Shorter:

更短：

$html = preg_replace("/<script.*?\/script>/s", "", $html);

When doing regex things might go wrong, so it's safer to do like this:

做正则表达式时可能会出错，所以这样做更安全：

$html = preg_replace("/<script.*?\/script>/s", "", $html) ? : $html;

So that when the "accident" happen, we get the original $html instead of empty string.

这样当“事故”发生时，我们得到的是原始的 $html 而不是空字符串。

Answer 8

回答by ctf0

this is a merge of both ClandestineCoder& Binh WPO.

这是ClandestineCoder和Binh WPO的合并。

the problem with the script tag arrows is that they can have more than one variant

脚本标签箭头的问题在于它们可以有多个变体

ex. (< = <= &lt;) & ( > = >= &gt;)

前任。（<= = <）（>&lt;= >= &gt;）

so instead of creating a pattern array with like a bazillion variant, imho a better solution would be

因此，与其创建一个具有无数变体的模式数组，恕我直言，更好的解决方案是

return preg_replace('/script.*?\/script/ius', '', $text)
       ? preg_replace('/script.*?\/script/ius', '', $text)
       : $text;

this will remove anything that look like script.../scriptregardless of the arrow code/variant and u can test it in here https://regex101.com/r/lK6vS8/1

这将删除任何看起来像script.../script箭头代码/变体的东西，你可以在这里测试它https://regex101.com/r/lK6vS8/1

Answer 9

回答by mae

This is a simplified variant of Dejan Marjanovic's answer:

这是 Dejan Marjanovic 答案的简化变体：

function removeTags($html, $tag) {
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    foreach (iterator_to_array($dom->getElementsByTagName($tag)) as $item) {
        $item->parentNode->removeChild($item);
    }
    return $dom->saveHTML();
}

Can be used to remove any kind of tag, including <script>:

可用于删除任何类型的标签，包括<script>：

$scriptlessHtml = removeTags($html, 'script');

Answer 10

回答by Malvolio

I would use BeautifulSoup if it's available. Makes this sort of thing very easy.

如果有的话，我会使用 BeautifulSoup。使这种事情变得非常容易。

Don'ttry to do it with regexps. That way lies madness.

不要尝试用正则表达式来做。那就是疯狂。

php 从 HTML 内容中删除脚本标签

提问by I-M-JM

回答by Dejan Marjanovic

回答by Alex

回答by prasanthnv

回答by José Carlos PHP

回答by tech-e

回答by ClandestineCoder

回答by Binh WPO

回答by ctf0

回答by mae

回答by Malvolio

相关推荐

最近更新

标签

php 从 HTML 内容中删除脚本标签

提问by I-M-JM

回答by Dejan Marjanovic

回答by Alex

回答by prasanthnv

回答by José Carlos PHP

回答by tech-e

回答by ClandestineCoder

回答by Binh WPO

回答by ctf0

回答by mae

回答by Malvolio

相关推荐

在 PHP 中是否可以防止“致命错误：调用未定义的函数”？

php Symfony3 - SQLSTATE[HY000] [2002] 没有这样的文件或目录

使用 PHP 检查 URL 是否具有特定字符串

php PHPMailer：您必须至少提供一个收件人电子邮件地址

相关推荐

最近更新

标签