Java:从 HTML 中删除 Javascript 的最佳方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4156723/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-25 10:32:07  来源:igfitidea点击:

Java: Best way to remove Javascript from HTML

javajavascriptxss

提问by mtyson

What's the best library/approach for removing Javascript from HTML that will be displayed?

从将显示的 HTML 中删除 Javascript 的最佳库/方法是什么?

For example, take:

例如,取:

<html><body><span onmousemove='doBadXss()'>test</span></body></html>

and leave:

然后离开:

<html><body><span>test</span></body></html>

I see the DeXSSproject. But is that the best way to go?

我看到了DeXSS项目。但这是最好的方式吗?

回答by beetstra

JSoup has a simple method for sanitizing HTML based on a whitelist. Check http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer

JSoup 有一个简单的方法来基于白名单清理 HTML。检查http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer

It uses a whitelist, which is safer then the blacklist approach DeXSS uses. From the DeXSS page:

它使用白名单,这比 DeXSS 使用的黑名单方法更安全。从 DeXSS 页面:

There are still a number of known XSS attacks that DeXSS does not yet detect.

仍然有许多已知的 XSS 攻击 DeXSS 尚未检测到。

A blacklist only disallows known unsafe constructions, while a whitelist only allows known safe constructions. So unknown, possibly unsafe constructions will only be protected against with a whitelist.

黑名单仅禁止已知的不安全构造,而白名单仅允许已知的安全构造。因此,未知的、可能不安全的结构只能通过白名单进行保护。

回答by haylem

The easiest way would be to not have those in the first place... It probably would make sense to allow only very simple tags to be used in free-form fields and to disallow any kind of attributes.

最简单的方法是一开始就不要使用这些标签......在自由格式字段中只允许使用非常简单的标签并禁止任何类型的属性可能是有意义的。

Probably not the answer you're going for, but in many cases you only want to provide markup capabilities, not a full editing suite.

可能不是您想要的答案,但在许多情况下,您只想提供标记功能,而不是完整的编辑套件。



Similarly, another even easier approach would be to provide a text-based syntax, like Markdown, for editing. (not that many ways you can exploit the SO edit area, for instance. Markdown syntax + limited tag list without attributes).

类似地,另一种更简单的方法是提供基于文本的语法,如 Markdown,用于编辑。(例如,您可以利用 SO 编辑区域的方式并不多。Markdown 语法 + 没有属性的有限标签列表)。

回答by Richard H

You could try dom4j http://dom4j.sourceforge.net/dom4j-1.6.1/This is a DOM parser (as opposed to SAX) and allows you to easily traverse and manipulate the DOM, removing node attributes like onmouseoverfor example (or entire elements like <script>), before writing back out or streaming somewhere. Depending on how wild your html is, you may need to clean it up first - jtidy http://jtidy.sourceforge.net/is good.

您可以尝试 dom4j http://dom4j.sourceforge.net/dom4j-1.6.1/这是一个 DOM 解析器(与 SAX 相对)并允许您轻松遍历和操作 DOM,删除节点属性onmouseover,例如(或像<script>)这样的整个元素,然后再写回或在某处流式传输。根据您的 html 的狂野程度,您可能需要先清理它 - jtidy http://jtidy.sourceforge.net/很好。

But obviously doing all this involves some overhead if you're doing this at page render time.

但显然,如果您在页面渲染时执行此操作,则执行所有这些操作会涉及一些开销。