如何使用 C# 从 HTML 页面中删除 <script> 标签？

Question

提问by StackOverflowVeryHelpful

<html>
    <head>
        <script type="text/javascript" src="jquery.js"></script>
        <script type="text/javascript">
            if (window.self === window.top) { $.getScript("Wing.js"); }
        </script>
   </head>
</html>

Is there a way in C# to modify the above HTML file and convert it into this format:

C#中有没有办法修改上面的HTML文件，转换成这样的格式：

<html>
    <head>
    </head>
</html>

Basically my goal is to remove all the JavaScript from the HTML page. I don't know what is be the best way to modify the HTML files. I want to do it programmatically as there are hundreds of files which need to be modified.

基本上我的目标是从 HTML 页面中删除所有 JavaScript。我不知道修改 HTML 文件的最佳方法是什么。我想以编程方式进行，因为有数百个文件需要修改。

Answer 1

采纳答案by Jerry

It can be done using regex:

可以使用正则表达式完成：

Regex rRemScript = new Regex(@"<script[^>]*>[\s\S]*?</script>");
output = rRemScript.Replace(input, "");

Answer 2

回答by mckeejm

May be worth a look: HTML Agility Pack

可能值得一看：HTML Agility Pack

Edit: specific working code

编辑：特定的工作代码

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
string sampleHtml = 
    "<html>" +
        "<head>" + 
                "<script type=\"text/javascript\" src=\"jquery.js\"></script>" +
                "<script type=\"text/javascript\">" + 
                    "if (window.self === window.top) { $.getScript(\"Wing.js\"); }" +
                "</script>" +
        "</head>" +
    "</html>";
MemoryStream ms = new MemoryStream(Encoding.ASCII.GetBytes(sampleHtml));

doc.Load(ms);

List<HtmlNode> nodes = new List<HtmlNode>(doc.DocumentNode.Descendants("head"));
int childNodeCount = nodes[0].ChildNodes.Count;
for (int i = 0; i < childNodeCount; i++)
    nodes[0].ChildNodes.Remove(0);
Console.WriteLine(doc.DocumentNode.OuterHtml);

Answer 3

回答by jim tollan

I think as others have said, HtmlAgility pack is the best route. I've used this to scrapeand remove loads of hard to cornercases. However, if a simple regex is your goal, then maybe you could try <script(.+?)*</script>. This will remove nasty nested javascript as well as normal stuff, i.e the type referred to in the link (Regular Expression for Extracting Script Tags):

我认为正如其他人所说，HtmlAgility 包是最好的途径。我已经用它来刮除和移除大量难以处理的案例。但是，如果您的目标是一个简单的正则表达式，那么也许您可以尝试<script(.+?)*</script>. 这将删除讨厌的嵌套 javascript 以及普通内容，即链接中引用的类型（用于提取脚本标签的正则表达式）：

<html>
<head>
    <script type="text/javascript" src="jquery.js"></script>
    <script type="text/javascript">
        if (window.self === window.top) { $.getScript("Wing.js"); }
    </script>
    <script> // nested horror
    var s = "<script></script>";
    </script>
</head>
</html>

usage:

用法：

Regex regxScriptRemoval = new Regex(@"<script(.+?)*</script>");
var newHtml = regxScriptRemoval.Replace(oldHtml, "");

return newHtml; // etc etc

Answer 4

回答by ashuai

using regex:

使用正则表达式：

string result = Regex.Replace(
    input, 
    @"</?(?i:script|embed|object|frameset|frame|iframe|meta|link|style)(.|\n|\s)*?>", 
    string.Empty, 
    RegexOptions.Singleline | RegexOptions.IgnoreCase
);

Answer 5

回答by Jenny O'Reilly

This may seem like a strange solution.

这似乎是一个奇怪的解决方案。

If you don't want to use any third party library to do it and don't need to actually remove the script code, just kind of disable it, you could do this:

如果您不想使用任何第三方库来执行此操作并且不需要实际删除脚本代码，只需禁用它，您可以这样做：

html = Regex.Replace(html , @"<script[^>]*>", "<!--");
html = Regex.Replace(html , @"<\/script>", "-->");

This creates an HTML comment out of script tags.

这会从脚本标签中创建一个 HTML 注释。

如何使用 C# 从 HTML 页面中删除 <script> 标签？

提问by StackOverflowVeryHelpful

采纳答案by Jerry

回答by mckeejm

回答by jim tollan

回答by ashuai

回答by Jenny O'Reilly

相关推荐

最近更新

标签

如何使用 C# 从 HTML 页面中删除 <script> 标签？

提问by StackOverflowVeryHelpful

采纳答案by Jerry

回答by mckeejm

回答by jim tollan

回答by ashuai

回答by Jenny O'Reilly

相关推荐

C# ConnectionString 属性尚未初始化错误

C# LEFT JOIN 在 LINQ 中加入实体？

C# 只选择一个复选框

C# 如何使用 Newtonsoft.Json 将对象序列化为具有类型信息的 json？

相关推荐

最近更新

标签