Javascript 使用 JS 正则表达式从 html 中删除所有脚本标签
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6659351/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Removing all script tags from html with JS Regular Expression
提问by Kennedy
i want to strip script tags out of this html at pastebin
我想在 pastebin 中从这个 html 中去除脚本标签
I tried using the below regular expression
我尝试使用下面的正则表达式
html.replace(/<script.*>.*<\/script>/ims, " ")
But it does not remove all script tags in the html. It only removes in-line scripts. Please i need a regex that can remove all script tags(in-line and multi-line). It would be highly appreciated if a test is carried out on my sample http://pastebin.com/mdxygM0a
但它不会删除 html 中的所有脚本标签。它只删除内嵌脚本。请我需要一个可以删除所有脚本标签(内联和多行)的正则表达式。如果对我的样本进行测试http://pastebin.com/mdxygM0a将不胜感激
Thanks
谢谢
回答by ThiefMaster
jQuery uses a regex to remove script tags in some cases and I'm pretty sure its devs had a damn good reason to do so. Probably some browser doesexecute scripts when inserting them using innerHTML
.
在某些情况下,jQuery 使用正则表达式来删除脚本标签,我很确定它的开发人员有充分的理由这样做。也许有些浏览器不使用时插入其中执行脚本innerHTML
。
Here's the regex:
这是正则表达式:
/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi
And before people start crying "but regexes for HTML are evil": Yes, they are- but for script tags they are safe because of the special behaviour - a <script>
section may not contain </script>
at all unless it should end at this position. So matching it with a regex is easily possible. However, from a quick look the regex above does not account for trailing whitespace inside the closing tag so you'd have to test if </script???
etc. will still work.
在人们开始哭泣“但是 HTML 的正则表达式是邪恶的”之前:是的,它们是- 但是对于脚本标签,由于特殊行为,它们是安全的 - 一个<script>
部分可能根本不包含</script>
,除非它应该在这个位置结束。因此很容易将它与正则表达式匹配。但是,快速查看上面的正则表达式并没有考虑结束标签内的尾随空格,因此您必须测试</script???
等是否仍然有效。
回答by RobG
Attempting to remove HTML markup using a regular expression is problematic. You don't know what's in there as script or attribute values. One way is to insert it as the innerHTML of a div, remove any script elements and return the innerHTML, e.g.
尝试使用正则表达式删除 HTML 标记是有问题的。您不知道那里有什么作为脚本或属性值。一种方法是将其作为div的innerHTML插入,删除任何脚本元素并返回innerHTML,例如
function stripScripts(s) {
var div = document.createElement('div');
div.innerHTML = s;
var scripts = div.getElementsByTagName('script');
var i = scripts.length;
while (i--) {
scripts[i].parentNode.removeChild(scripts[i]);
}
return div.innerHTML;
}
alert(
stripScripts('<span><script type="text/javascript">alert(\'foo\');<\/script><\/span>')
);
Note that at present, browsers will not execute the script if inserted using the innerHTML property, and likely never will especially as the element is not added to the document.
请注意,目前,如果使用 innerHTML 属性插入,浏览器将不会执行脚本,并且可能永远不会执行,特别是因为该元素未添加到文档中。
回答by Conrad Damon
Regexes are beatable, but if you have a string version of HTML that you don't want to inject into a DOM, they may be the best approach. You may want to put it in a loop to handle something like:
正则表达式是可以击败的,但是如果您不想将 HTML 的字符串版本注入到 DOM 中,那么它们可能是最好的方法。你可能想把它放在一个循环中来处理类似的事情:
<scr<script>Ha!</script>ipt> alert(document.cookie);</script>
Here's what I did, using the jquery regex from above:
这是我所做的,使用上面的 jquery 正则表达式:
var SCRIPT_REGEX = /<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi;
while (SCRIPT_REGEX.test(text)) {
text = text.replace(SCRIPT_REGEX, "");
}
回答by spaark
This Regex should work too:
这个正则表达式也应该工作:
<script(?:(?!\/\/)(?!\/\*)[^'"]|"(?:\.|[^"\])*"|'(?:\.|[^'\])*'|\/\/.*(?:\n)|\/\*(?:(?:.|\s))*?\*\/)*?<\/script>
It even allows to have "problematic" variable strings like these inside:
它甚至允许在内部包含“有问题的”变量字符串:
<script type="text/javascript">
var test1 = "</script>";
var test2 = '\'</script>';
var test1 = "\"</script>";
var test1 = "<script>\"";
var test2 = '<scr\'ipt>';
/* </script> */
// </script>
/* ' */
// var foo=" '
</script>
It seams that jQuery and Prototype fail on these ones...
看来 jQuery 和 Prototype 在这些方面失败了......
Edit July 31 '17:Added a) non-capturing groups for better performance (and no empty groups) and b) support for JavaScript comments.
2017 年 7 月 31 日编辑:添加了 a) 非捕获组以获得更好的性能(并且没有空组)和 b) 对 JavaScript 注释的支持。
回答by neongrau
Whenever you have to resort to Regex based script tag cleanup. At least add a white-space to the closing tag in the form of
每当您不得不求助于基于正则表达式的脚本标记清理时。至少以以下形式在结束标记中添加一个空格
</script\s*>
Otherwise things like
否则像
<script>alert(666)</script >
would remain since trailing spaces after tagnames are valid.
将保留,因为标记名后的尾随空格有效。
回答by shao
Why not using jQuery.parseHTML() http://api.jquery.com/jquery.parsehtml/?
为什么不使用 jQuery.parseHTML() http://api.jquery.com/jquery.parsehtml/?
回答by Jason Sebring
In my case, I needed a requirement to parse out the page title AND and have all the other goodness of jQuery, minus it firing scripts. Here is my solution that seems to work.
就我而言,我需要一个要求来解析页面标题和并拥有 jQuery 的所有其他优点,减去它触发脚本。这是我的解决方案,似乎有效。
$.get('/somepage.htm', function (data) {
// excluded code to extract title for simplicity
var bodySI = data.indexOf('<body>') + '<body>'.length,
bodyEI = data.indexOf('</body>'),
body = data.substr(bodySI, bodyEI - bodySI),
$body;
body = body.replace(/<script[^>]*>/gi, ' <!-- ');
body = body.replace(/<\/script>/gi, ' --> ');
//console.log(body);
$body = $('<div>').html(body);
console.log($body.html());
});
This kind of shortcuts worries about script because you are not trying to remove out the script tags and content, instead you are replacing them with comments rendering schemes to break them useless as you would have comments delimiting your script declarations.
这种快捷方式担心脚本,因为您不是试图删除脚本标签和内容,而是将它们替换为注释呈现方案以打破它们的无用,因为您将用注释分隔脚本声明。
Let me know if that still presents a problem as it will help me too.
如果这仍然存在问题,请告诉我,因为它也会对我有所帮助。
回答by Shivanshu Goyal
If you want to remove all JavaScript code from some HTML text, then removing <script>
tags isn't enough, because JavaScript can still live in "onclick", "onerror", "href" and other attributes.
如果你想从一些 HTML 文本中删除所有 JavaScript 代码,那么删除<script>
标签是不够的,因为 JavaScript 仍然可以存在于“onclick”、“onerror”、“href”和其他属性中。
Try out this npm module which handles all of this: https://www.npmjs.com/package/strip-js
试试这个处理所有这些的 npm 模块:https: //www.npmjs.com/package/strip-js
回答by Pooja Roy
You can try
你可以试试
$("your_div_id").remove();
or
或者
$("your_div_id").html("");
回答by surinder singh
Try this:
尝试这个:
var text = text.replace(/<script[^>]*>(?:(?!<\/script>)[^])*<\/script>/g, "")