Java:如何检查字符是否属于特定的 unicode 块?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/404733/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 12:15:01  来源:igfitidea点击:

Java: how to check if character belongs to a specific unicode block?

javaregexunicodechar

提问by IddoG

I need to identify what natural language my input belongs to. The goal is to distinguish between Arabicand Englishwords in a mixed input, where the input is Unicode and is extracted from XML text nodes. I have noticed the class Character.UnicodeBlock. Is it related to my problem? How can I get it to work?

我需要确定我的输入属于哪种自然语言。目标是区分混合输入中的阿拉伯语英语单词,其中输入是 Unicode 并从 XML 文本节点中提取。我注意到了班级Character.UnicodeBlock。和我的问题有关吗?我怎样才能让它工作?

Edit:The Character.UnicodeBlockapproach was useful for Arabic, but apparently doesn't do it for English (or other European languages) because the BASIC_LATINUnicode block covers symbols and non-printable characters as well as letters. So now I am using the matches()method of the Stringobject with the regex expression "[A-Za-z]+"instead. I can live with it, but perhaps someone can suggest a nicer/faster way.

编辑:Character.UnicodeBlock方法对阿拉伯语很有用,但显然不适用于英语(或其他欧洲语言),因为BASIC_LATINUnicode 块涵盖符号和不可打印的字符以及字母。所以现在我使用带有正则表达式matches()String对象的方法"[A-Za-z]+"。我可以接受它,但也许有人可以提出更好/更快的方法。

回答by Dennis C

Yes, you can simply use Character.UnicodeBlock.of(char)

是的,您可以简单地使用Character.UnicodeBlock.of(char)

回答by Alan Moore

If [A-Za-z]+meets your requirement, you aren't going to find anything faster or prettier. However, if you want to match all letters in the Latin1 block (including accented letters and ligatures), you can use this:

如果[A-Za-z]+满足您的要求,您将找不到任何更快或更漂亮的东西。但是,如果要匹配 Latin1 块中的所有字母(包括重音字母和连字),可以使用以下命令:

Pattern p = Pattern.compile("[\pL&&\p{L1}]+");

That's the intersection of the set of all Unicode letters and the set of all Latin1 characters.

那是所有 Unicode 字母集和所有 Latin1 字符集的交集。

回答by nwellnhof

The Unicode Script propertyis probably more useful. In Java, it can be looked up using the java.lang.Character.UnicodeScriptclass:

Unicode文字属性可能更有用。在 Java 中,可以使用java.lang.Character.UnicodeScript类来查找它:

Character.UnicodeScript script = Character.UnicodeScript.of(c);

回答by james.garriss

English characters tend to be in these 4 Unicode blocks:

英文字符往往在这 4 个 Unicode 块中:

ArrayList<Character.UnicodeBlock> english = new ArrayList<>();
english.add(Character.UnicodeBlock.BASIC_LATIN);
english.add(Character.UnicodeBlock.LATIN_1_SUPPLEMENT);
english.add(Character.UnicodeBlock.LATIN_EXTENDED_A);
english.add(Character.UnicodeBlock.GENERAL_PUNCTUATION);

So if you have a String, you can loop over all the characters and see what Unicode block each character is in:

因此,如果您有一个字符串,则可以遍历所有字符并查看每个字符所在的 Unicode 块:

for (char currentChar : myString.toCharArray())  
{
    Character.UnicodeBlock unicodeBlock = Character.UnicodeBlock.of(currentChar);
    if (english.contains(unicodeBlock))
    {
        // This character is English
    }
}

If they are all English, then you know you have characters that all English. You could repeat this for any language; you'll just have to figure out what Unicode blocks each language uses.

如果它们都是英文的,那么你就知道你的字符都是英文的。您可以对任何语言重复此操作;你只需要弄清楚每种语言使用什么 Unicode 块。

Note: This does NOT mean that you've proven the language isEnglish. You've only proven it uses characters found in English. It could be French, German, Spanish, or other languages whose characters have a lot of overlap with English.

注意:这并不意味着您已经证明该语言英语。您只证明它使用了英语中的字符。它可以是法语、德语、西班牙语或其他字符与英语有很多重叠的语言。

There are other ways to detect the actual natural language. Libraries like langdetect, which I have used with great success, can do this for you:

还有其他方法可以检测实际的自然语言。像 langdetect 这样的库,我已经成功使用了,可以为您做到这一点:

https://code.google.com/p/language-detection/

https://code.google.com/p/language-detection/

回答by Fernando Miguélez

You have the opposite problem to this one, but ironically what doesn't work for him it just should work great for you. It is to just look for words in English (only ASCII compatible chars) with reg-exp "\w".

你有相反的问题,以这一项,但讽刺什么不是他的工作,它只是应该为你工作的伟大。它只是使用 reg-exp "\w" 查找英文单词(仅限 ASCII 兼容字符)。