javascript 删除脚本和样式标签中的所有内容

Question

提问by jkushner

I have a variable named $articleTextand it contains html code. There are scriptand stylecodes within <script>and <style>html elements. I want to scan the $articleTextand remove these pieces of code. If I can also remove the actual html elements <script>, </script>, <style>and </style>, I would do that too.

我有一个名为的变量$articleText，它包含 html 代码。和html 元素中有script和style代码。我想扫描并删除这些代码。如果我还可以删除实际的HTML元素，，和，我会做到这一点。<script><style>$articleText<script></script><style></style>

I imagine I need to be using regex however I am not skilled in it.

我想我需要使用正则表达式，但我不熟练。

Can anyone assist?

任何人都可以提供帮助吗？

I wish I could provide some code but like I said I am not skilled in regex so I don't have anything to show.

我希望我能提供一些代码，但就像我说的我不擅长正则表达式，所以我没有任何东西可以展示。

I cannot use DOM. I need specifically to use regex against these specific tags

我不能使用 DOM。我需要专门针对这些特定标签使用正则表达式

Answer 1

回答by ?mega

Even regex is not a good tool for this kind of task, for small simple task it may work.

即使正则表达式也不是此类任务的好工具，对于小型简单任务它可能会起作用。

If you want to remove just inner text of tag(s), use:

如果您只想删除标签的内部文本，请使用：

preg_replace('/(<(script|style)\b[^>]*>).*?(<\/>)/is', "", $txt);

See demo here.

在这里查看演示。

If you want to remove also tags, replacement string in the above code would be empty, so just "".

如果您还想删除标签，则上述代码中的替换字符串将为空，因此只需"".

Answer 2

回答by Chris Baker

Do not use RegEx on HTML. PHP provides a tool for parsing DOM structures, called appropriately DomDocument.

不要在 HTML 上使用 RegEx。PHP 提供了一个解析 DOM 结构的工具，适当地称为 DomDocument。

<?php
// some HTML for example
$myHtml = '<html><head><script>alert("hi mom!");</script></head><body><style>body { color: red;} </style><h1>This is some content</h1><p>content is awesome</p></body><script src="someFile.js"></script></html>';

// create a new DomDocument object
$doc = new DOMDocument();

// load the HTML into the DomDocument object (this would be your source HTML)
$doc->loadHTML($myHtml);

removeElementsByTagName('script', $doc);
removeElementsByTagName('style', $doc);
removeElementsByTagName('link', $doc);

// output cleaned html
echo $doc->saveHtml();

function removeElementsByTagName($tagName, $document) {
  $nodeList = $document->getElementsByTagName($tagName);
  for ($nodeIdx = $nodeList->length; --$nodeIdx >= 0; ) {
    $node = $nodeList->item($nodeIdx);
    $node->parentNode->removeChild($node);
  }
}

You can try it here: https://eval.in/private/4f225fa0dcb4eb

你可以在这里试试：https: //eval.in/private/4f225fa0dcb4eb

Documentation

文档

DomDocument- http://php.net/manual/en/class.domdocument.php
DomNodeList- http://php.net/manual/en/class.domnodelist.php
DomDocument::getElementsByTagName- http://us3.php.net/manual/en/domdocument.getelementsbytagname.php

DomDocument- http://php.net/manual/en/class.domdocument.php
DomNodeList- http://php.net/manual/en/class.domnodelist.php
DomDocument::getElementsByTagName- http://us3.php.net/manual/en/domdocument.getelementsbytagname.php

Answer 3

回答by zamnuts

Here's sample data:

以下是示例数据：

$in = '
<html>
    <head>
        <script type="text/javascript">window.location="somehwere";</script>
        <style>
            .someCSS {border:1px solid black;}
        </style>
    </head>
    <body>
        <p>....</p>
        <div>
            <script type="text/javascript">document.write("bad stuff");</script>
        </div>
        <ul>
            <li><style type="text/css">#moreCSS {font-weight:900;}</style></li>
        </ul>
    </body>
</html>';

And now the spelled-out version:

现在是拼写版本：

$dom = new DOMDocument('1.0','UTF-8');
$dom->loadHTML($in);

removeByTag($dom,'style');
removeByTag($dom,'script');

var_dump($dom->saveHTML());

function removeByTag($dom,$tag) {
    $nodeList = $dom->getElementsByTagName($tag);
    removeAll($nodeList);
}

function removeAll($nodeList) {
    for ( $i = $nodeList->length; --$i >=0; ) {
        removeSelf($nodeList->item($i));
    }
}

function removeSelf($node) {
    $node->parentNode->removeChild($node);
}

And an alternate (does the same thing, just no function declarations):

还有一个替代方法（做同样的事情，只是没有函数声明）：

$dom = new DOMDocument('1.0','UTF-8');
$dom->loadHTML($in);

for ( $list = $dom->getElementsByTagName('script'), $i = $list->length; --$i >=0; ) {
    $node = $list->item($i);
    $node->parentNode->removeChild($node);
}

for ( $list = $dom->getElementsByTagName('style'), $i = $list->length; --$i >=0; ) {
    $node = $list->item($i);
    $node->parentNode->removeChild($node);
}

var_dump($dom->saveHTML());

The trick is to iterate backwardswhen deleting nodes. And getElementsByTagName will traverse the entire DOM for you, so you don't have to (none of that hasChildNodes, nextSibling, nextChild stuff).

诀窍是在删除节点时向后迭代。并且 getElementsByTagName 将为您遍历整个 DOM，因此您不必（没有一个 hasChildNodes、nextSibling、nextChild 东西）。

Perhaps the best solution is somewhere in between those two extreme examples.

也许最好的解决方案是介于这两个极端例子之间。

Couldn't help myself, this is probably the best version of my suggestions. It doesn't include an incrementor ($i) to muck things up, and removes from the bottom-up:

无法自拔，这可能是我建议的最佳版本。它不包含用于处理问题的增量器 ( $i)，而是自下而上删除：

$dom = new DOMDocument('1.0','UTF-8');
$dom->loadHTML($in);

removeElementsByTagName($dom,'script');
removeElementsByTagName($dom,'style');

function removeElementsByTagName($dom,$tagName) {
    $list = $dom->getElementsByTagName($tagName);
    while ( $node = $list->item(0) ) {
        $node->parentNode->removeChild($node);
    }
}

var_dump($dom->saveHTML());

As you remove nodes, they get moved up in the child list of the parent, so 1 becomes 0 and 2 becomes 1, etc. Keep doing this (while) until there aren't anymore (->itemreturns null). Also wrapped this in a reusable function.

当您删除节点时，它们会在父节点的子列表中向上移动，因此 1 变为 0，2 变为 1，依此类推。继续执行此操作 ( while) 直到不再存在（->item返回 null）。还将其包装在一个可重用的函数中。

Answer 4

回答by Pappa

I think this should do what you need (assuming there are no nested script and style tags):

我认为这应该可以满足您的需求（假设没有嵌套的脚本和样式标签）：

preg_replace('/(<script[^>]*>.+?<\/script>|<style[^>]*>.+?<\/style>)/s', '', $articleText);

Answer 5

回答by Web and Flow

Assuming this is both a concern of not letting your design get messed up by random styles as well as secure your site from user scripting, removing these tags will not alone keep you safe.

假设这既是不让您的设计被随机样式搞砸的问题，又是为了保护您的网站免受用户脚本的影响，仅删除这些标签并不能保证您的安全。

Consider the case of event attributes(ex: onmouseover, onclick):

考虑事件属性的情况（例如：onmouseover、onclick）：

<h1 onclick="console.log('user made this happen');">User Scripting Test</h1>

or even worse

甚至更糟

<h1 onclick='function addCSSRule(a,b,c,d){"insertRule"in a?a.insertRule(b+"{"+c+"}",d):"addRule"in a&&a.addRule(b,c,d)}var style=document.createElement("style");style.appendChild(document.createTextNode("")),document.head.appendChild(style),sheet=style.sheet,addCSSRule(sheet,"*","color: #ff0!important");'>Messing with your styles!</h1>

With this, it's fairly trivial to start inserting all sorts of stuff into the document.

有了这个，开始向文档中插入各种内容就变得很简单了。

Last example of stylesheet mods taken from David Walsh -https://davidwalsh.name/add-rules-stylesheets

来自 David Walsh 的样式表 mod 的最后一个示例 - https://davidwalsh.name/add-rules-stylesheets

The only solution

唯一的解决办法

... is to use a proven third-party library that specializes in this. I suggest HTML Purifier. It'll rid your user input of styles, scripts, and pesky event attributes.

... 是使用经过验证的第三方库，专门用于此。我建议HTML Purifier。它将摆脱用户输入的样式、脚本和讨厌的事件属性。

Answer 6

回答by Curt

A regex to do this would be incredibly obtuse, because of the possibility of tags within tags, and such confounding constructs like tag attributes.

执行此操作的正则表达式将非常笨拙，因为标签中可能存在标签，以及诸如标签属性之类的混淆结构。

I would suggest doing this in a DOM (either in PHP or JavaScript), which can identify and remove the undesired tags through actual parsing.

我建议在 DOM（在 PHP 或 JavaScript 中）执行此操作，它可以通过实际解析来识别和删除不需要的标签。

javascript 删除脚本和样式标签中的所有内容

提问by jkushner

I cannot use DOM. I need specifically to use regex against these specific tags

我不能使用 DOM。我需要专门针对这些特定标签使用正则表达式

回答by ?mega

回答by Chris Baker

回答by zamnuts

回答by Pappa

回答by Web and Flow

The only solution

唯一的解决办法

回答by Curt

相关推荐

最近更新

标签

javascript 删除脚本和样式标签中的所有内容

提问by jkushner

I cannot use DOM. I need specifically to use regex against these specific tags

我不能使用 DOM。我需要专门针对这些特定标签使用正则表达式

回答by ?mega

回答by Chris Baker

回答by zamnuts

回答by Pappa

回答by Web and Flow

The only solution

唯一的解决办法

回答by Curt

相关推荐

Javascript 警报在 asp.net 的更新面板中不起作用

javascript 如何使用/创建 MANIFEST、处理 appCache 事件/错误以及使用 swapCache

javascript 数字和一位小数的正则表达式

javascript jQuery Google Maps 按 ID 获取标记

相关推荐

最近更新

标签