java 如何确定一个字符串是英语还是阿拉伯语?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15107313/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to determine a string is english or arabic?
提问by Victor S
Is there a way to determine a string is English or Arabic?
有没有办法确定字符串是英语还是阿拉伯语?
回答by Eyal Schneider
Here is a simple logic that I just tried:
这是我刚刚尝试的一个简单逻辑:
public static boolean isProbablyArabic(String s) {
for (int i = 0; i < s.length();) {
int c = s.codePointAt(i);
if (c >= 0x0600 && c <= 0x06E0)
return true;
i += Character.charCount(c);
}
return false;
}
It declares the text as arabic if and only if an arabic unicode code point is found in the text. You can enhance this logic to be more suitable for your needs.
当且仅当在文本中找到阿拉伯语 unicode 代码点时,它才将文本声明为阿拉伯语。您可以增强此逻辑以更适合您的需求。
The range 0600 - 06E0 is the code point range of Arabic characters and symbols (See Unicode tables)
范围 0600 - 06E0 是阿拉伯字符和符号的代码点范围(请参阅Unicode 表)
回答by Gaurav Tyagi
Java in itself supports various language checks by unicode, Arabic is also supported. Much simpler and smallest way to do the same is by UnicodeBlock
Java 本身支持通过 unicode 进行的各种语言检查,也支持阿拉伯语。更简单和最小的方法是通过 UnicodeBlock
public static boolean textContainsArabic(String text) {
for (char charac : text.toCharArray()) {
if (Character.UnicodeBlock.of(charac) == Character.UnicodeBlock.ARABIC) {
return true;
}
}
return false;
}
回答by RamDroid
A minor change to cover all arabic characters and symbols range
涵盖所有阿拉伯字符和符号范围的小改动
private boolean isArabic(String text){
String textWithoutSpace = text.trim().replaceAll(" ",""); //to ignore whitepace
for (int i = 0; i < textWithoutSpace.length();) {
int c = textWithoutSpace.codePointAt(i);
//range of arabic chars/symbols is from 0x0600 to 0x06ff
//the arabic letter '??' is special case having the range from 0xFE70 to 0xFEFF
if (c >= 0x0600 && c <=0x06FF || (c >= 0xFE70 && c<=0xFEFF))
i += Character.charCount(c);
else
return false;
}
return true;
}
回答by paxdiablo
You can usually tell by the code points within the string itself. Arabic occupies certain blocksin the Unicode code space.
您通常可以通过字符串本身的代码点来判断。阿拉伯语在 Unicode 代码空间中占据了某些块。
It's a fairly safe bet that, if a substantial proportion of the characters exist in those blocks (such as ???? ???????? ???? ?????????
), it's Arabic text.
可以肯定的是,如果这些块中存在大量字符(例如???? ???????? ???? ?????????
),则它是阿拉伯文本。
回答by Pranav V R
English characters tend to be in these 4 Unicode blocks:
英文字符往往在这 4 个 Unicode 块中:
- BASIC_LATIN
- LATIN_1_SUPPLEMENT
- LATIN_EXTENDED_A
GENERAL_PUNCTUATION
public static boolean isEnglish(String text) { boolean onlyEnglish = false; for (char character : text.toCharArray()) { if (Character.UnicodeBlock.of(character) == Character.UnicodeBlock.BASIC_LATIN || Character.UnicodeBlock.of(character) == Character.UnicodeBlock.LATIN_1_SUPPLEMENT || Character.UnicodeBlock.of(character) == Character.UnicodeBlock.LATIN_EXTENDED_A || Character.UnicodeBlock.of(character) == Character.UnicodeBlock.GENERAL_PUNCTUATION) { onlyEnglish = true; } else { onlyEnglish = false; } } return onlyEnglish; }
- BASIC_LATIN
- LATIN_1_SUPPLEMENT
- LATIN_EXTENDED_A
GENERAL_PUNCTUATION
public static boolean isEnglish(String text) { boolean onlyEnglish = false; for (char character : text.toCharArray()) { if (Character.UnicodeBlock.of(character) == Character.UnicodeBlock.BASIC_LATIN || Character.UnicodeBlock.of(character) == Character.UnicodeBlock.LATIN_1_SUPPLEMENT || Character.UnicodeBlock.of(character) == Character.UnicodeBlock.LATIN_EXTENDED_A || Character.UnicodeBlock.of(character) == Character.UnicodeBlock.GENERAL_PUNCTUATION) { onlyEnglish = true; } else { onlyEnglish = false; } } return onlyEnglish; }
回答by Iman Marashi
This answeris somewhat correct. But when we combine Farsi and English letters it returns TRUE!, which is not true. Here I modified the same method so that it works well
这个答案有点正确。但是当我们组合波斯语和英语字母时,它返回TRUE!,这是不正确的。这里我修改了相同的方法,使其运行良好
public static boolean isProbablyArabic(String s) {
for (int i = 0; i < s.length();) {
int c = s.codePointAt(i);
if (!(c >= 0x0600 && c <= 0x06E0))
return false;
i += Character.charCount(c);
}
return true;
}
回答by Basile Starynkevitch
You could use N-gram-based text categorization(google for that phrase) but it is not a fail-proof technique, and it may require a not too short string.
您可以使用基于 N-gram 的文本分类(谷歌搜索该短语),但它不是一种防故障技术,它可能需要一个不太短的字符串。
You might also decide that a string with only ASCII letters is not Arabic.
您也可能认为只有 ASCII 字母的字符串不是阿拉伯语。
回答by Saeid
Try This :
试试这个 :
internal static bool ContainsArabicLetters(string text)
{
foreach (char character in text.ToCharArray())
{
if (character >= 0x600 && character <= 0x6ff)
return true;
if (character >= 0x750 && character <= 0x77f)
return true;
if (character >= 0xfb50 && character <= 0xfc3f)
return true;
if (character >= 0xfe70 && character <= 0xfefc)
return true;
}
return false;
}