Java:如何检查字符是否属于特定的 unicode 块?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/404733/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Java: how to check if character belongs to a specific unicode block?
提问by IddoG
I need to identify what natural language my input belongs to.
The goal is to distinguish between Arabicand Englishwords in a mixed input, where the input is Unicode and is extracted from XML text nodes.
I have noticed the class Character.UnicodeBlock. Is it related to my problem? How can I get it to work?
我需要确定我的输入属于哪种自然语言。目标是区分混合输入中的阿拉伯语和英语单词,其中输入是 Unicode 并从 XML 文本节点中提取。我注意到了班级Character.UnicodeBlock。和我的问题有关吗?我怎样才能让它工作?
Edit:The Character.UnicodeBlockapproach was useful for Arabic, but apparently doesn't do it for English (or other European languages) because the BASIC_LATINUnicode block covers symbols and non-printable characters as well as letters.
So now I am using the matches()method of the Stringobject with the regex expression "[A-Za-z]+"instead. I can live with it, but perhaps someone can suggest a nicer/faster way.
编辑:该Character.UnicodeBlock方法对阿拉伯语很有用,但显然不适用于英语(或其他欧洲语言),因为BASIC_LATINUnicode 块涵盖符号和不可打印的字符以及字母。所以现在我使用带有正则表达式matches()的String对象的方法"[A-Za-z]+"。我可以接受它,但也许有人可以提出更好/更快的方法。
回答by Dennis C
Yes, you can simply use Character.UnicodeBlock.of(char)
是的,您可以简单地使用Character.UnicodeBlock.of(char)
回答by Alan Moore
If [A-Za-z]+meets your requirement, you aren't going to find anything faster or prettier. However, if you want to match all letters in the Latin1 block (including accented letters and ligatures), you can use this:
如果[A-Za-z]+满足您的要求,您将找不到任何更快或更漂亮的东西。但是,如果要匹配 Latin1 块中的所有字母(包括重音字母和连字),可以使用以下命令:
Pattern p = Pattern.compile("[\pL&&\p{L1}]+");
That's the intersection of the set of all Unicode letters and the set of all Latin1 characters.
那是所有 Unicode 字母集和所有 Latin1 字符集的交集。
回答by nwellnhof
The Unicode Script propertyis probably more useful. In Java, it can be looked up using the java.lang.Character.UnicodeScriptclass:
在Unicode文字属性可能更有用。在 Java 中,可以使用java.lang.Character.UnicodeScript类来查找它:
Character.UnicodeScript script = Character.UnicodeScript.of(c);
回答by james.garriss
English characters tend to be in these 4 Unicode blocks:
英文字符往往在这 4 个 Unicode 块中:
ArrayList<Character.UnicodeBlock> english = new ArrayList<>();
english.add(Character.UnicodeBlock.BASIC_LATIN);
english.add(Character.UnicodeBlock.LATIN_1_SUPPLEMENT);
english.add(Character.UnicodeBlock.LATIN_EXTENDED_A);
english.add(Character.UnicodeBlock.GENERAL_PUNCTUATION);
So if you have a String, you can loop over all the characters and see what Unicode block each character is in:
因此,如果您有一个字符串,则可以遍历所有字符并查看每个字符所在的 Unicode 块:
for (char currentChar : myString.toCharArray())
{
Character.UnicodeBlock unicodeBlock = Character.UnicodeBlock.of(currentChar);
if (english.contains(unicodeBlock))
{
// This character is English
}
}
If they are all English, then you know you have characters that all English. You could repeat this for any language; you'll just have to figure out what Unicode blocks each language uses.
如果它们都是英文的,那么你就知道你的字符都是英文的。您可以对任何语言重复此操作;你只需要弄清楚每种语言使用什么 Unicode 块。
Note: This does NOT mean that you've proven the language isEnglish. You've only proven it uses characters found in English. It could be French, German, Spanish, or other languages whose characters have a lot of overlap with English.
注意:这并不意味着您已经证明该语言是英语。您只证明它使用了英语中的字符。它可以是法语、德语、西班牙语或其他字符与英语有很多重叠的语言。
There are other ways to detect the actual natural language. Libraries like langdetect, which I have used with great success, can do this for you:
还有其他方法可以检测实际的自然语言。像 langdetect 这样的库,我已经成功使用了,可以为您做到这一点:

