捕获 XSS(跨站点脚本)攻击(在 Java 中)的最佳正则表达式?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24723/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 07:16:15  来源:igfitidea点击:

Best regex to catch XSS (Cross-site Scripting) attack (in Java)?

javahtmlregexxss

提问by Thierry-Dimitri Roy

Jeff actually posted about this in Sanitize HTML. But his example is in C# and I'm actually more interested in a Java version. Does anyone have a better version for Java? Is his example good enough to just convert directly from C# to Java?

Jeff 实际上在Sanitize HTML 中发布了有关此内容的信息。但他的例子是用 C# 编写的,我实际上对 Java 版本更感兴趣。有没有人有更好的Java版本?他的示例是否足以直接从 C# 转换为 Java?

[Update] I have put a bounty on this question because SO wasn't as popular when I asked the question as it is today (*). As for anything related to security, the more people look into it, the better it is!

[更新] 我悬赏这个问题,因为当我问这个问题时,SO 并不像今天这样受欢迎 (*)。至于任何与安全有关的东西,越多的人研究它越好!

(*) In fact, I think it was still in closed beta

(*) 事实上,我认为它仍然处于内测阶段

采纳答案by Chase Seibert

Don't do this with regular expressions. Remember, you're not protecting just against valid HTML; you're protecting against the DOM that web browsers create. Browsers can be tricked into producing valid DOM from invalid HTML quite easily.

不要用正则表达式这样做。请记住,您不仅仅是在保护有效的 HTML;您正在防止 Web 浏览器创建的 DOM。浏览器很容易被欺骗从无效的 HTML 生成有效的 DOM。

For example, see this list of obfuscated XSS attacks. Are you prepared to tailor a regex to prevent this real world attack on Yahoo and Hotmailon IE6/7/8?

例如,请参阅此混淆 XSS 攻击列表。你准备好定制一个正则表达式来防止在 IE6/7/8 上对雅虎和 Hotmail 的这种现实世界攻击吗?

<HTML><BODY>
<?xml:namespace prefix="t" ns="urn:schemas-microsoft-com:time">
<?import namespace="t" implementation="#default#time2">
<t:set attributeName="innerHTML" to="XSS&lt;SCRIPT DEFER&gt;alert(&quot;XSS&quot;)&lt;/SCRIPT&gt;">
</BODY></HTML>

How about this attack that works on IE6?

这种在 IE6 上有效的攻击怎么样?

<TABLE BACKGROUND="javascript:alert('XSS')">

How about attacks that are not listed on this site? The problem with Jeff's approach is that it's not a whitelist, as claimed. As someone on that pageadeptly notes:

未在此站点上列出的攻击如何?Jeff 方法的问题在于它不是白名单,正如所声称的那样。正如该页面上的某人巧妙地指出:

The problem with it, is that the html must be clean. There are cases where you can pass in hacked html, and it won't match it, in which case it'll return the hacked html string as it won't match anything to replace. This isn't strictly whitelisting.

它的问题是 html 必须是干净的。在某些情况下,您可以传入被黑的 html,但它不会匹配它,在这种情况下,它将返回被黑的 html 字符串,因为它不会匹配任何要替换的内容。这不是严格的白名单。

I would suggest a purpose built tool like AntiSamy. It works by actually parsing the HTML, and then traversing the DOM and removing anything that's not in the configurablewhitelist. The major difference is the ability to gracefully handle malformed HTML.

我建议使用像AntiSamy这样的专用工具。它的工作原理是实际解析 HTML,然后遍历 DOM 并删除不在可配置白名单中的任何内容。主要区别在于能够优雅地处理格式错误的 HTML。

The best part is that it actually unit tests for all the XSS attacks on the above site. Besides, what could be easier than this API call:

最好的部分是它实际上对上述站点上的所有 XSS 攻击进行了单元测试。此外,还有什么比这个 API 调用更容易:

public String toSafeHtml(String html) throws ScanException, PolicyException {

    Policy policy = Policy.getInstance(POLICY_FILE);
    AntiSamy antiSamy = new AntiSamy();
    CleanResults cleanResults = antiSamy.scan(html, policy);
    return cleanResults.getCleanHTML().trim();
}

回答by svrist

The biggest problem by using jeffs code is the @ which currently isnt available.

使用 jeffs 代码的最大问题是当前不可用的 @。

I would probably just take the "raw" regexp from jeffs code if i needed it and paste it into

如果我需要的话,我可能会从 jeffs 代码中获取“原始”正则表达式并将其粘贴到

http://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html

http://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html

and see the things needing escape get escaped and then use it.

并看到需要转义的东西被转义然后使用它。



Taking the usage of this regex in mind I would personally make sure I understood exactly what I was doing, why and what consequences would be if I didnt succeed, before copy/pasting anything, like the other answers try to help you with.

考虑到这个正则表达式的使用,在复制/粘贴任何东西之前,我会亲自确保我完全理解我在做什么,为什么以及如果我没有成功会产生什么后果,就像其他答案试图帮助你一样。

(Thats propbably pretty sound advice for any copy/paste)

(对于任何复制/粘贴,这可能是非常合理的建议)

回答by potyl

I'm not to convinced that using a regular expression is the best way for finding all suspect code. Regular expressions are quite easy to trick specially when dealing with broken HTML. For example, the regular expression listed in the Sanitize HTML link will fail to remove all 'a' elements that have an attribute between the element name and the attribute 'href':

我不相信使用正则表达式是查找所有可疑代码的最佳方式。特别是在处理损坏的 HTML 时,正则表达式很容易被欺骗。例如,Sanitize HTML 链接中列出的正则表达式将无法删除所有具有介于元素名称和属性“href”之间的属性的“a”元素:

< a alt="xss injection" href="http://www.malicous.com/bad.php" >

<a alt="xss 注入" href="http://www.malcous.com/bad.php" >

A more robust way of removing malicious code is to rely on a XML Parser that can handle all kind of HTML documents (Tidy, TagSoup, etc) and to select the elements to remove with an XPath expression. Once the HTML document is parsed into a DOM document the elements to revome can be found easily and safely. This is even easy to do with XSLT.

删除恶意代码的一种更强大的方法是依靠可以处理所有类型的 HTML 文档(Tidy、TagSoup 等)的 XML 解析器,并使用 XPath 表达式选择要删除的元素。一旦 HTML 文档被解析为 DOM 文档,就可以轻松安全地找到要修改的元素。使用 XSLT 甚至可以轻松做到这一点。

回答by Brian

[\s\w\.]*. If it doesn't match, you've got XSS. Maybe. Take note that this expression only allows letters, numbers, and periods. It avoids all symbols, even useful ones, out of fear of XSS. Once you allow &, you've got worries. And merely replacing all instances of & with &amp;is not sufficient. Too complicated to trust :P. Obviously this will disallow a lot of legitimate text (You can just replace all nonmatching characters with a ! or something), but I think it will kill XSS.

[\s\w\.]*. 如果不匹配,则说明您有 XSS。也许。请注意,此表达式仅允许使用字母、数字和句点。出于对 XSS 的恐惧,它避免了所有符号,甚至是有用的符号。一旦你允许 &,你就有了担忧。仅仅用 & 替换所有实例&amp;是不够的。太复杂了,无法信任:P。显然,这将禁止大量合法文本(您可以用 ! 或其他东西替换所有不匹配的字符),但我认为它会杀死 XSS。

The idea to just parse it as html and generate new html is probably better.

将其解析为 html 并生成新 html 的想法可能更好。

回答by Brian

^(\s|\w|\d|<br>)*?$ 

This will validate characters, digits, whitespaces and also the <br>tag. If you want more risk you can add more tags like

这将验证字符、数字、空格和<br>标签。如果你想要更多的风险,你可以添加更多的标签,比如

^(\s|\w|\d|<br>|<ul>|<\ul>)*?$

回答by user3709489

I extracted from NoScript best Anti-XSS addon, here is its Regex: Work flawless:

我从 NoScript 最好的 Anti-XSS 插件中提取出来,这是它的正则表达式:工作完美:

<[^\w<>]*(?:[^<>"'\s]*:)?[^\w<>]*(?:\W*s\W*c\W*r\W*i\W*p\W*t|\W*f\W*o\W*r\W*m|\W*s\W*t\W*y\W*l\W*e|\W*s\W*v\W*g|\W*m\W*a\W*r\W*q\W*u\W*e\W*e|(?:\W*l\W*i\W*n\W*k|\W*o\W*b\W*j\W*e\W*c\W*t|\W*e\W*m\W*b\W*e\W*d|\W*a\W*p\W*p\W*l\W*e\W*t|\W*p\W*a\W*r\W*a\W*m|\W*i?\W*f\W*r\W*a\W*m\W*e|\W*b\W*a\W*s\W*e|\W*b\W*o\W*d\W*y|\W*m\W*e\W*t\W*a|\W*i\W*m\W*a?\W*g\W*e?|\W*v\W*i\W*d\W*e\W*o|\W*a\W*u\W*d\W*i\W*o|\W*b\W*i\W*n\W*d\W*i\W*n\W*g\W*s|\W*s\W*e\W*t|\W*i\W*s\W*i\W*n\W*d\W*e\W*x|\W*a\W*n\W*i\W*m\W*a\W*t\W*e)[^>\w])|(?:<\w[\s\S]*[\s##代码##\/]|['"])(?:formaction|style|background|src|lowsrc|ping|on(?:d(?:e(?:vice(?:(?:orienta|mo)tion|proximity|found|light)|livery(?:success|error)|activate)|r(?:ag(?:e(?:n(?:ter|d)|xit)|(?:gestur|leav)e|start|drop|over)?|op)|i(?:s(?:c(?:hargingtimechange|onnect(?:ing|ed))|abled)|aling)|ata(?:setc(?:omplete|hanged)|(?:availabl|chang)e|error)|urationchange|ownloading|blclick)|Moz(?:M(?:agnifyGesture(?:Update|Start)?|ouse(?:PixelScroll|Hittest))|S(?:wipeGesture(?:Update|Start|End)?|crolledAreaChanged)|(?:(?:Press)?TapGestur|BeforeResiz)e|EdgeUI(?:C(?:omplet|ancel)|Start)ed|RotateGesture(?:Update|Start)?|A(?:udioAvailable|fterPaint))|c(?:o(?:m(?:p(?:osition(?:update|start|end)|lete)|mand(?:update)?)|n(?:t(?:rolselect|extmenu)|nect(?:ing|ed))|py)|a(?:(?:llschang|ch)ed|nplay(?:through)?|rdstatechange)|h(?:(?:arging(?:time)?ch)?ange|ecking)|(?:fstate|ell)change|u(?:echange|t)|l(?:ick|ose))|m(?:o(?:z(?:pointerlock(?:change|error)|(?:orientation|time)change|fullscreen(?:change|error)|network(?:down|up)load)|use(?:(?:lea|mo)ve|o(?:ver|ut)|enter|wheel|down|up)|ve(?:start|end)?)|essage|ark)|s(?:t(?:a(?:t(?:uschanged|echange)|lled|rt)|k(?:sessione|comma)nd|op)|e(?:ek(?:complete|ing|ed)|(?:lec(?:tstar)?)?t|n(?:ding|t))|u(?:ccess|spend|bmit)|peech(?:start|end)|ound(?:start|end)|croll|how)|b(?:e(?:for(?:e(?:(?:scriptexecu|activa)te|u(?:nload|pdate)|p(?:aste|rint)|c(?:opy|ut)|editfocus)|deactivate)|gin(?:Event)?)|oun(?:dary|ce)|l(?:ocked|ur)|roadcast|usy)|a(?:n(?:imation(?:iteration|start|end)|tennastatechange)|fter(?:(?:scriptexecu|upda)te|print)|udio(?:process|start|end)|d(?:apteradded|dtrack)|ctivate|lerting|bort)|DOM(?:Node(?:Inserted(?:IntoDocument)?|Removed(?:FromDocument)?)|(?:CharacterData|Subtree)Modified|A(?:ttrModified|ctivate)|Focus(?:Out|In)|MouseScroll)|r(?:e(?:s(?:u(?:m(?:ing|e)|lt)|ize|et)|adystatechange|pea(?:tEven)?t|movetrack|trieving|ceived)|ow(?:s(?:inserted|delete)|e(?:nter|xit))|atechange)|p(?:op(?:up(?:hid(?:den|ing)|show(?:ing|n))|state)|a(?:ge(?:hide|show)|(?:st|us)e|int)|ro(?:pertychange|gress)|lay(?:ing)?)|t(?:ouch(?:(?:lea|mo)ve|en(?:ter|d)|cancel|start)|ime(?:update|out)|ransitionend|ext)|u(?:s(?:erproximity|sdreceived)|p(?:gradeneeded|dateready)|n(?:derflow|load))|f(?:o(?:rm(?:change|input)|cus(?:out|in)?)|i(?:lterchange|nish)|ailed)|l(?:o(?:ad(?:e(?:d(?:meta)?data|nd)|start)?|secapture)|evelchange|y)|g(?:amepad(?:(?:dis)?connected|button(?:down|up)|axismove)|et)|e(?:n(?:d(?:Event|ed)?|abled|ter)|rror(?:update)?|mptied|xit)|i(?:cc(?:cardlockerror|infochange)|n(?:coming|valid|put))|o(?:(?:(?:ff|n)lin|bsolet)e|verflow(?:changed)?|pen)|SVG(?:(?:Unl|L)oad|Resize|Scroll|Abort|Error|Zoom)|h(?:e(?:adphoneschange|l[dp])|ashchange|olding)|v(?:o(?:lum|ic)e|ersion)change|w(?:a(?:it|rn)ing|heel)|key(?:press|down|up)|(?:AppComman|Loa)d|no(?:update|match)|Request|zoom))[\s##代码##]*=

Test: http://regex101.com/r/rV7zK8

测试:http: //regex101.com/r/rV7zK8

I think it block 99% XSS because it is a part of NoScript, a addon that get updated regularly

我认为它可以阻止 99% 的 XSS,因为它是 NoScript 的一部分,一个定期更新的插件

回答by KIC

An old thread but maybe this will be useful for other users. There is a maintained security layer tool for php: https://github.com/PHPIDS/It is based on a set of regex which you can find here:

一个旧线程,但也许这对其他用户有用。有一个用于 php 的维护安全层工具:https: //github.com/PHPIDS/它基于一组正则表达式,您可以在这里找到:

https://github.com/PHPIDS/PHPIDS/blob/master/lib/IDS/default_filter.xml

https://github.com/PHPIDS/PHPIDS/blob/master/lib/IDS/default_filter.xml

回答by Philip DiSarro

This question perfectly illustrates a great application of the study of Theory of Computation. Theory of Computation is a field that focuses on producing and studying mathematical representations for computation.

这个问题完美地说明了计算理论研究的巨大应用。计算理论是一个专注于产生和研究用于计算的数学表示的领域。

Some of the most profound research in the computation theory includes the proofs that illustrate the relationships of various languages.

计算理论中一些最深刻的研究包括说明各种语言关系的证明。

Some of the language relationships that computation theorists have proven include:

计算理论家已经证明的一些语言关系包括:

enter image description here

在此处输入图片说明

This shows that context free languages are strictly more powerful than regular languages. Thus if a language is explicitly context-free (context-free and not regular), then it is impossible for anyregular expression to recognize it.

这表明上下文无关语言比常规语言更强大。因此,如果一种语言是明确的上下文无关(上下文无关而不是正则),那么任何正则表达式都不可能识别它。

JavaScript is at the very least context-free, thus we know with one-hundred percent certainty that designing a regular expression (regex) capable of catching all XSS is a mathematically impossible task.

JavaScript 至少是上下文无关的,因此我们有 100% 的把握知道,设计一个能够捕获所有 XSS 的正则表达式(regex)在数学上是不可能完成的任务。