php 从 HTML 内容中删除脚本标签

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7130867/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-26 02:06:02  来源:igfitidea点击:

remove script tag from HTML content

phpregexhtmlpurifier

提问by I-M-JM

I am using HTML Purifier (http://htmlpurifier.org/)

我正在使用 HTML Purifier (http://htmlpurifier.org/)

I just want to remove <script>tags only. I don't want to remove inline formatting or any other things.

我只想删除<script>标签。我不想删除内联格式或任何其他内容。

How can I achieve this?

我怎样才能做到这一点?

One more thing, it there any other way to remove script tags from HTML

还有一件事,还有其他方法可以从 HTML 中删除脚本标签

回答by Dejan Marjanovic

Because this question is tagged with regexI'm going to answer with poor man's solution in this situation:

因为这个问题是用正则表达式标记的,所以在这种情况下,我将用穷人的解决方案来回答:

$html = preg_replace('#<script(.*?)>(.*?)</script>#is', '', $html);

However, regular expressions are not for parsing HTML/XML, even if you write the perfectexpression it will break eventually, it's not worth it, although, in some cases it's useful to quickly fix some markup, and as it is with quick fixes, forget about security. Use regex only on content/markup you trust.

但是,正则表达式不是用于解析 HTML/XML,即使您编写了完美的表达式,它最终也会崩溃,但这并不值得,尽管在某些情况下,快速修复某些标记很有用,并且就像快速修复一样,忘记安全。仅在您信任的内容/标记上使用正则表达式。

Remember, anything that user inputs should be considered not safe.

请记住,用户输入的任何内容都应被视为不安全

Bettersolution here would be to use DOMDocumentwhich is designed for this. Here is a snippet that demonstrate how easy, clean (compared to regex), (almost) reliable and (nearly) safe is to do the same:

这里更好的解决方案是使用DOMDocument专为此设计的。这是一个片段,演示了执行相同操作是多么容易、干净(与正则表达式相比)、(几乎)可靠和(几乎)安全:

<?php

$html = <<<HTML
...
HTML;

$dom = new DOMDocument();

$dom->loadHTML($html);

$script = $dom->getElementsByTagName('script');

$remove = [];
foreach($script as $item)
{
  $remove[] = $item;
}

foreach ($remove as $item)
{
  $item->parentNode->removeChild($item); 
}

$html = $dom->saveHTML();

I have removed the HTML intentionally because even this can bork.

我有意删除了 HTML,因为即使这样也可以bork

回答by Alex

Use the PHP DOMDocumentparser.

使用 PHPDOMDocument解析器。

$doc = new DOMDocument();

// load the HTML string we want to strip
$doc->loadHTML($html);

// get all the script tags
$script_tags = $doc->getElementsByTagName('script');

$length = $script_tags->length;

// for each tag, remove it from the DOM
for ($i = 0; $i < $length; $i++) {
  $script_tags->item($i)->parentNode->removeChild($script_tags->item($i));
}

// get the HTML string back
$no_script_html_string = $doc->saveHTML();

This worked me me using the following HTML document:

这让我使用以下 HTML 文档:

<!doctype html>
<html>
    <head>
        <meta charset="utf-8">
        <title>
            hey
        </title>
        <script>
            alert("hello");
        </script>
    </head>
    <body>
        hey
    </body>
</html>

Just bear in mind that the DOMDocumentparser requires PHP 5 or greater.

请记住,DOMDocument解析器需要 PHP 5 或更高版本。

回答by prasanthnv

$html = <<<HTML
...
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags_to_remove = array('script','style','iframe','link');
foreach($tags_to_remove as $tag){
    $element = $dom->getElementsByTagName($tag);
    foreach($element  as $item){
        $item->parentNode->removeChild($item);
    }
}
$html = $dom->saveHTML();

回答by José Carlos PHP

A simple way by manipulating string.

通过操作字符串的简单方法。

$str = stripStr($str, '<script', '</script>');

function stripStr($str, $ini, $fin)
{
    while(($pos = mb_stripos($str, $ini)) !== false)
    {
        $aux = mb_substr($str, $pos + mb_strlen($ini));
        $str = mb_substr($str, 0, $pos).mb_substr($aux, mb_stripos($aux, $fin) + mb_strlen($fin));
    }

    return $str;
}

回答by tech-e

An example modifing ctf0's answer. This should only do the preg_replace once but also check for errors and block char code for forward slash.

修改 ctf0 答案的示例。这应该只执行 preg_replace 一次,但还要检查错误并阻止正斜杠的字符代码。

$str = '<script> var a - 1; <&#47;script>'; 

$pattern = '/(script.*?(?:\/|&#47;|&#x0002F;)script)/ius';
$replace = preg_replace($pattern, '', $str); 
return ($replace !== null)? $replace : $str;  

If you are using php 7 you can use the null coalesce operator to simplify it even more.

如果您使用的是 php 7,您可以使用 null 合并运算符来进一步简化它。

$pattern = '/(script.*?(?:\/|&#47;|&#x0002F;)script)/ius'; 
return (preg_replace($pattern, '', $str) ?? $str); 

回答by ClandestineCoder

I had been struggling with this question. I discovered you only really need one function. explode('>', $html); The single common denominator to any tag is < and >. Then after that it's usually quotation marks ( " ). You can extract information so easily once you find the common denominator. This is what I came up with:

我一直在为这个问题而苦苦挣扎。我发现你真的只需要一个功能。爆炸('>',$ html);任何标签的单一公分母是 < 和 >。然后之后通常是引号( " )。一旦找到公分母,您就可以轻松提取信息。这就是我想出的:

$html = file_get_contents('http://some_page.html');

$h = explode('>', $html);

foreach($h as $k => $v){

    $v = trim($v);//clean it up a bit

    if(preg_match('/^(<script[.*]*)/ius', $v)){//my regex here might be questionable

        $counter = $k;//match opening tag and start counter for backtrace

        }elseif(preg_match('/([.*]*<\/script$)/ius', $v)){//but it gets the job done

            $script_length = $k - $counter;

            $counter = 0;

            for($i = $script_length; $i >= 0; $i--){
                $h[$k-$i] = '';//backtrace and clear everything in between
                }
            }           
        }
for($i = 0; $i <= count($h); $i++){
    if($h[$i] != ''){
    $ht[$i] = $h[$i];//clean out the blanks so when we implode it works right.
        }
    }
$html = implode('>', $ht);//all scripts stripped.


echo $html;

I see this really only working for script tags because you will never have nested script tags. Of course, you can easily add more code that does the same check and gather nested tags.

我认为这实际上只适用于脚本标签,因为您永远不会有嵌套的脚本标签。当然,您可以轻松添加更多代码来执行相同的检查并收集嵌套标签。

I call it accordion coding. implode();explode(); are the easiest ways to get your logic flowing if you have a common denominator.

我称之为手风琴编码。内爆();爆炸();如果你有一个共同点,这是让你的逻辑流畅的最简单的方法。

回答by Binh WPO

Shorter:

更短:

$html = preg_replace("/<script.*?\/script>/s", "", $html);

$html = preg_replace("/<script.*?\/script>/s", "", $html);

When doing regex things might go wrong, so it's safer to do like this:

做正则表达式时可能会出错,所以这样做更安全:

$html = preg_replace("/<script.*?\/script>/s", "", $html) ? : $html;

$html = preg_replace("/<script.*?\/script>/s", "", $html) ? : $html;

So that when the "accident" happen, we get the original $html instead of empty string.

这样当“事故”发生时,我们得到的是原始的 $html 而不是空字符串。

回答by ctf0

  • this is a merge of both ClandestineCoder& Binh WPO.
  • 这是ClandestineCoderBinh WPO的合并。

the problem with the script tag arrows is that they can have more than one variant

脚本标签箭头的问题在于它们可以有多个变体

ex. (< = &lt;= &amp;lt;) & ( > = &gt;= &amp;gt;)

前任。(<= = &lt;)(>&amp;lt;= &gt;= &amp;gt;

so instead of creating a pattern array with like a bazillion variant, imho a better solution would be

因此,与其创建一个具有无数变体的模式数组,恕我直言,更好的解决方案是

return preg_replace('/script.*?\/script/ius', '', $text)
       ? preg_replace('/script.*?\/script/ius', '', $text)
       : $text;

this will remove anything that look like script.../scriptregardless of the arrow code/variant and u can test it in here https://regex101.com/r/lK6vS8/1

这将删除任何看起来像script.../script箭头代码/变体的东西,你可以在这里测试它https://regex101.com/r/lK6vS8/1

回答by mae

This is a simplified variant of Dejan Marjanovic's answer:

这是 Dejan Marjanovic 答案的简化变体:

function removeTags($html, $tag) {
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    foreach (iterator_to_array($dom->getElementsByTagName($tag)) as $item) {
        $item->parentNode->removeChild($item);
    }
    return $dom->saveHTML();
}

Can be used to remove any kind of tag, including <script>:

可用于删除任何类型的标签,包括<script>

$scriptlessHtml = removeTags($html, 'script');

回答by Malvolio

I would use BeautifulSoup if it's available. Makes this sort of thing very easy.

如果有的话,我会使用 BeautifulSoup。使这种事情变得非常容易。

Don'ttry to do it with regexps. That way lies madness.

不要尝试用正则表达式来做。那就是疯狂。