如何使用 C# 从 HTML 页面中删除 <script> 标签?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19414829/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to remove <script> tags from an HTML page using C#?
提问by StackOverflowVeryHelpful
<html>
<head>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript">
if (window.self === window.top) { $.getScript("Wing.js"); }
</script>
</head>
</html>
Is there a way in C# to modify the above HTML file and convert it into this format:
C#中有没有办法修改上面的HTML文件,转换成这样的格式:
<html>
<head>
</head>
</html>
Basically my goal is to remove all the JavaScript from the HTML page. I don't know what is be the best way to modify the HTML files. I want to do it programmatically as there are hundreds of files which need to be modified.
基本上我的目标是从 HTML 页面中删除所有 JavaScript。我不知道修改 HTML 文件的最佳方法是什么。我想以编程方式进行,因为有数百个文件需要修改。
采纳答案by Jerry
It can be done using regex:
可以使用正则表达式完成:
Regex rRemScript = new Regex(@"<script[^>]*>[\s\S]*?</script>");
output = rRemScript.Replace(input, "");
回答by mckeejm
May be worth a look: HTML Agility Pack
可能值得一看:HTML Agility Pack
Edit: specific working code
编辑:特定的工作代码
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
string sampleHtml =
"<html>" +
"<head>" +
"<script type=\"text/javascript\" src=\"jquery.js\"></script>" +
"<script type=\"text/javascript\">" +
"if (window.self === window.top) { $.getScript(\"Wing.js\"); }" +
"</script>" +
"</head>" +
"</html>";
MemoryStream ms = new MemoryStream(Encoding.ASCII.GetBytes(sampleHtml));
doc.Load(ms);
List<HtmlNode> nodes = new List<HtmlNode>(doc.DocumentNode.Descendants("head"));
int childNodeCount = nodes[0].ChildNodes.Count;
for (int i = 0; i < childNodeCount; i++)
nodes[0].ChildNodes.Remove(0);
Console.WriteLine(doc.DocumentNode.OuterHtml);
回答by jim tollan
I think as others have said, HtmlAgility pack is the best route. I've used this to scrapeand remove loads of hard to cornercases. However, if a simple regex is your goal, then maybe you could try <script(.+?)*</script>
. This will remove nasty nested javascript as well as normal stuff, i.e the type referred to in the link (Regular Expression for Extracting Script Tags):
我认为正如其他人所说,HtmlAgility 包是最好的途径。我已经用它来刮除和移除大量难以处理的案例。但是,如果您的目标是一个简单的正则表达式,那么也许您可以尝试<script(.+?)*</script>
. 这将删除讨厌的嵌套 javascript 以及普通内容,即链接中引用的类型(用于提取脚本标签的正则表达式):
<html>
<head>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript">
if (window.self === window.top) { $.getScript("Wing.js"); }
</script>
<script> // nested horror
var s = "<script></script>";
</script>
</head>
</html>
usage:
用法:
Regex regxScriptRemoval = new Regex(@"<script(.+?)*</script>");
var newHtml = regxScriptRemoval.Replace(oldHtml, "");
return newHtml; // etc etc
回答by ashuai
using regex:
使用正则表达式:
string result = Regex.Replace(
input,
@"</?(?i:script|embed|object|frameset|frame|iframe|meta|link|style)(.|\n|\s)*?>",
string.Empty,
RegexOptions.Singleline | RegexOptions.IgnoreCase
);
回答by Jenny O'Reilly
This may seem like a strange solution.
这似乎是一个奇怪的解决方案。
If you don't want to use any third party library to do it and don't need to actually remove the script code, just kind of disable it, you could do this:
如果您不想使用任何第三方库来执行此操作并且不需要实际删除脚本代码,只需禁用它,您可以这样做:
html = Regex.Replace(html , @"<script[^>]*>", "<!--");
html = Regex.Replace(html , @"<\/script>", "-->");
This creates an HTML comment out of script tags.
这会从脚本标签中创建一个 HTML 注释。