Java 用于排除特殊字符的正则表达式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/756567/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Regular expression for excluding special characters
提问by
I am having trouble coming up with a regular expression which would essentially black list certain special characters.
我在想出一个正则表达式时遇到了麻烦,该表达式基本上会将某些特殊字符列入黑名单。
I need to use this to validate data in input fields (in a Java Web app). We want to allow users to enter any digit, letter (we need to include accented characters, ex. French or German) and some special characters such as '-. etc.
我需要使用它来验证输入字段中的数据(在 Java Web 应用程序中)。我们希望允许用户输入任何数字、字母(我们需要包括重音字符,例如法语或德语)和一些特殊字符,例如“-”。等等。
How do I blacklist characters such as <>%$ etc?
如何将 <>%$ 等字符列入黑名单?
回答by Jason Coyne
Its usually better to whitelist characters you allow, rather than to blacklist characters you don't allow. both from a security standpoint, and from an ease of implementation standpoint.
通常最好将您允许的字符列入白名单,而不是将您不允许的字符列入黑名单。无论是从安全角度还是从易于实现的角度来看。
If you do go down the blacklist route, here is an example, but be warned, the syntax is not simple.
如果你确实走黑名单路线,这里是一个例子,但要注意,语法并不简单。
http://groups.google.com/group/regex/browse_thread/thread/0795c1b958561a07
http://groups.google.com/group/regex/browse_thread/thread/0795c1b958561a07
If you want to whitelist all the accent characters, perhaps using unicode ranges would help? Check out this link.
如果您想将所有重音字符列入白名单,也许使用 unicode 范围会有所帮助?看看这个链接。
回答by Lucero
Do you really want to blacklist specific characters or rather whitelist the allowed charachters?
您真的想将特定字符列入黑名单还是将允许的字符列入白名单?
I assume that you actually want the latter. This is pretty simple (add any additional symbols to whitelist into the [\-]
group):
我假设您实际上想要后者。这非常简单(将任何其他符号添加到白名单中[\-]
):
^(?:\p{L}\p{M}*|[\-])*$
Edit: Optimized the pattern with the input from the comments
编辑:使用评论中的输入优化模式
回答by Daniel Brückner
I would just white list the characters.
我只是将字符列入白名单。
^[a-zA-Z0-9??ü??ü]*$
Building a black list is equally simple with regex but you might need to add much more characters - there are a lot of Chinese symbols in unicode ... ;)
使用正则表达式建立黑名单同样简单,但您可能需要添加更多字符 - unicode 中有很多中文符号......;)
^[^<>%$]*$
The expression [^(many characters here)] just matches any character that is not listed.
表达式 [^(many characters here)] 只匹配未列出的任何字符。
回答by KarstenF
I guess it depends what language you are targeting. In general, something like this should work:
我想这取决于您的目标语言。一般来说,这样的事情应该有效:
[^<>%$]
The "[]
" construct defines a character class, which will match any of the listed characters. Putting "^
" as the first character negates the match, ie: any character OTHER than one of those listed.
" []
" 构造定义了一个字符类,它将匹配任何列出的字符。将“ ^
”作为第一个字符否定匹配,即:除列出的字符之外的任何字符。
You may need to escape some of the characters within the "[]
", depending on what language/regex engine you are using.
您可能需要转义 " []
" 中的某些字符,具体取决于您使用的语言/正则表达式引擎。
回答by BlairHippo
I strongly suspect it's going to be easier to come up with a list of the characters that ARE allowed vs. the ones that aren't -- and once you have that list, the regex syntax becomes quite straightforward. So put me down as another vote for "whitelist".
我强烈怀疑列出允许的字符列表和不允许的字符列表会更容易——一旦你有了这个列表,正则表达式的语法就变得非常简单了。所以把我当作对“白名单”的另一票。
回答by David Grayson
To exclude certain characters ( <, >, %, and $), you can make a regular expression like this:
要排除某些字符(<、>、% 和 $),您可以创建如下正则表达式:
[<>%$]
This regular expression will match all inputs that have a blacklisted character in them. The brackets define a character class, and the \ is necessary before the dollar sign because dollar sign has a special meaning in regular expressions.
此正则表达式将匹配所有包含列入黑名单的字符的输入。括号定义了一个字符类,在美元符号之前需要\,因为美元符号在正则表达式中具有特殊含义。
To add more characters to the black list, just insert them between the brackets; order does not matter.
要将更多字符添加到黑名单中,只需将它们插入括号之间即可;顺序无关紧要。
According to some Java documentation for regular expressions, you could use the expression like this:
根据正则表达式的一些Java 文档,您可以使用这样的表达式:
Pattern p = Pattern.compile("[<>%$]");
Matcher m = p.matcher(unsafeInputString);
if (m.matches())
{
// Invalid input: reject it, or remove/change the offending characters.
}
else
{
// Valid input.
}
回答by DJClayworth
Why do you consider regex the best tool for this? If your purpose is to detect whether an illegal character is present in a string, testing each character in a loop will be both simpler and more efficient than constructing a regex.
为什么你认为正则表达式是最好的工具?如果您的目的是检测字符串中是否存在非法字符,那么在循环中测试每个字符将比构建正则表达式更简单、更有效。
回答by Armstrongest
Here's all the french accented characters: ààa???ááééèèêê??ìì????òò????ùù??üü??'?
这是所有法语重音字符:ààa???ááééèèèê??ìì????òò????ùù??üü??'?
I would google a list of German accented characters. There aren't THAT many. You should be able to get them all.
我会用谷歌搜索一个德语重音字符列表。没有那么多。你应该能够得到它们。
For URLS I Replace accented URLs with regular letters like so:
对于 URLS,我用常规字母替换带重音的 URL,如下所示:
string beforeConversion = "ààa???ááééèèêê??ìì????òò????ùù??üü??'?";
string afterConversion = "aAaAaAaAeEeEeEeEiIiIiIoOoOoOuUuUuUcC'n";
for (int i = 0; i < beforeConversion.Length; i++) {
cleaned = Regex.Replace(cleaned, beforeConversion[i].ToString(), afterConversion[i].ToString());
}
There's probably a more efficient way, mind you.
请注意,可能有一种更有效的方法。
回答by Patanjali
Even in 2009, it seems too many had a very limited idea of what designing for the WORLDWIDE web involved. In 2015, unless designing for a specific country, a blacklist is the only way to accommodate the vast number of characters that may be valid.
即使在 2009 年,似乎也有太多人对 WORLDWIDE 网络设计所涉及的内容知之甚少。在 2015 年,除非针对特定国家/地区进行设计,否则黑名单是容纳可能有效的大量字符的唯一方法。
The characters to blacklist then need to be chosen according what is illegal for the purpose for which the data is required.
然后需要根据需要数据的目的,根据非法内容选择要列入黑名单的字符。
However, sometimes it pays to break down the requirements, and handle each separately. Here look-ahead is your friend. These are sections bounded by (?=)
for positive, and (?!)
for negative, and effectively become AND blocks, because when the block is processed, if not failed, the regex processor will begin at the start of the text with the next block. Effectively, each look-ahead block will be preceded by the ^
, and if its pattern is greedy, include up to the $
. Even the ancient VB6/VBA (Office) 5.5 regex engine supports look-ahead.
然而,有时分解需求并分别处理每个需求是值得的。这里的前瞻是你的朋友。这些部分以(?=)
正数和(?!)
负数为界,并有效地成为 AND 块,因为当块被处理时,如果没有失败,正则表达式处理器将在文本的开头开始下一个块。实际上,每个前瞻块都将在 之前^
,如果其模式是贪婪的,则最多包含$
。即使是古老的 VB6/VBA (Office) 5.5 正则表达式引擎也支持前瞻。
So, to build up a full regular expression, start with the look-ahead blocks, then add the blacklisted character block before the final $
.
因此,要构建完整的正则表达式,请从前瞻块开始,然后在最终$
.
For example, to limit the total numbers of characters, say between 3 and 15 inclusive, start with the positive look-ahead block (?=^.{3,15}$)
. Note that this needed its own ^
and $
to ensure that it covered all the text.
例如,要限制字符总数,例如介于 3 和 15 之间(包括 3 和 15),请从正前瞻块 开始(?=^.{3,15}$)
。请注意,这需要它自己的^
并$
确保它涵盖所有文本。
Now, while you might want to allow _ and -, you may not want to start or end with them, so add the two negative look-ahead blocks, (?!^[_-].+)
for starts, and (?!.+[_-]$)
for ends.
现在,虽然您可能希望允许 _ 和 -,但您可能不想以它们开始或结束,因此添加两个否定前瞻块,(?!^[_-].+)
用于开始和(?!.+[_-]$)
结束。
If you don't want multiple _
and -
, add a negative look-ahead block of (?!.*[_-]{2,})
. This will also exclude _-
and -_
sequences.
如果您不想要多个_
and -
,请添加(?!.*[_-]{2,})
. 这也将排除_-
和-_
排序。
If there are no more look-ahead blocks, then add the blacklist block before the $
, such as [^<>[\]{\}|\\\/^~%# :;,$%?\0-\cZ]+
, where the \0-\cZ
excludes null and control characters, including NL (\n
) and CR (\r
). The final +
ensures that all the text is greedily included.
如果没有更多的前瞻块,则在 之前添加黑名单块$
,例如[^<>[\]{\}|\\\/^~%# :;,$%?\0-\cZ]+
,其中\0-\cZ
排除空值和控制字符,包括 NL ( \n
) 和 CR ( \r
)。最后+
确保所有文本都被贪婪地包含在内。
Within the Unicode domain, there may well be other code-points or blocks that need to be excluded as well, but certainly a lot less than all the blocks that would have to be included in a whitelist.
在 Unicode 域内,很可能还有其他代码点或块需要排除,但肯定比必须包含在白名单中的所有块少得多。
The whole regex of all of the above would then be
以上所有的整个正则表达式将是
(?=^.{3,15}$)(?!^[_-].+)(?!.+[_-]$)(?!.*[_-]{2,})[^<>[\]{}|\\/^~%# :;,$%?^(?=[a-zA-Z0-9~@#$^*()_+=[\]{}|\,.?: -]*$)(?!.*[<>'"/;`%])
-\cZ]+$
which you can check out live on https://regex101.com/, for pcre (php), javascript and python regex engines. I don't know where the java regex fits in those, but you may need to modify the regex to cater for its idiosyncrasies.
您可以在https://regex101.com/ 上实时查看 pcre (php)、javascript 和 python 正则表达式引擎。我不知道 java regex 适合那些,但您可能需要修改 regex 以迎合它的特质。
If you want to include spaces, but not _
, just swap them every where in the regex.
如果您想包含空格,而不是_
,只需在正则表达式中的每个位置交换它们。
The most useful application for this technique is for the pattern
attribute for HTML input
fields, where a single expression is required, returning a false for failure, thus making the field invalid, allowing input:invalid
css to highlight it, and stopping the form being submitted.
这种技术最有用的应用是用于pattern
HTMLinput
字段的属性,其中需要单个表达式,失败时返回 false,从而使字段无效,允许input:invalid
css 突出显示它,并停止提交表单。
回答by Dharmender Tuli
Use This one
使用这个
##代码##