如何在 Java 中匹配 unicode 字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3103344/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I match unicode characters in Java
提问by ankimal
I m trying to match unicode characters in Java.
我正在尝试匹配 Java 中的 unicode 字符。
Input String: informa
输入字符串: informa
String to match : informátion
要匹配的字符串: informátion
So far I ve tried this:
到目前为止,我已经尝试过这个:
Pattern p= Pattern.compile("informa[\u0000-\uffff].*", (Pattern.UNICODE_CASE|Pattern.CANON_EQ|Pattern.CASE_INSENSITIVE));
String s = "informátion";
Matcher m = p.matcher(s);
if(m.matches()){
System.out.println("Match!");
}else{
System.out.println("No match");
}
It comes out as "No match". Any ideas?
结果显示为“不匹配”。有任何想法吗?
回答by BalusC
The term "Unicode characters" is not specific enough. It would match everycharacter which is in the Unicode range, thus also "normal" characters. This term is however very often used when one actuallymeans "characters which are not in the printable ASCII range".
术语“Unicode 字符”不够具体。它将匹配Unicode 范围内的每个字符,因此也匹配“正常”字符。然而,当人们实际上表示“不在可打印的 ASCII 范围内的字符”时,这个术语经常被使用。
In regex terms that would be [^\x20-\x7E].
在正则表达式中,这将是[^\x20-\x7E].
boolean containsNonPrintableASCIIChars = string.matches(".*[^\x20-\x7E].*");
Depending on what you'd like to do with this information, here are some useful follow-up answers:
根据您想对这些信息做什么,以下是一些有用的后续回答:
回答by Austin Fitzpatrick
Is it because informaisn't a substring of informátionat all?
是因为informa根本不是一个子串informátion吗?
How would your code work if you removed the last afrom informain your regex?
如果您a从informa正则表达式中删除最后一个,您的代码将如何工作?
回答by james.garriss
It sounds like you want to match letters while ignoring diacritical marks. If that's right, then normalize your strings to NFD form, strip out the diacritical marks, and then do your search.
听起来您想在忽略变音符号的同时匹配字母。如果这是对的,那么将您的字符串规范化为 NFD 形式,去掉变音符号,然后进行搜索。
String normalized = java.text.Normalizer.normalize(textToSearch, java.text.Normalizer.Form.NFD);
String withoutDiacritical = normalized.replaceAll("\p{InCombiningDiacriticalMarks}+", "");
// Search code goes here...
To learn more about NFD:
要了解有关 NFD 的更多信息:

