java 如何确定一个字符串是英语还是阿拉伯语？

Question

提问by Victor S

Is there a way to determine a string is English or Arabic?

有没有办法确定字符串是英语还是阿拉伯语？

Answer 1

回答by Eyal Schneider

Here is a simple logic that I just tried:

这是我刚刚尝试的一个简单逻辑：

  public static boolean isProbablyArabic(String s) {
    for (int i = 0; i < s.length();) {
        int c = s.codePointAt(i);
        if (c >= 0x0600 && c <= 0x06E0)
            return true;
        i += Character.charCount(c);            
    }
    return false;
  }

It declares the text as arabic if and only if an arabic unicode code point is found in the text. You can enhance this logic to be more suitable for your needs.

当且仅当在文本中找到阿拉伯语 unicode 代码点时，它才将文本声明为阿拉伯语。您可以增强此逻辑以更适合您的需求。

The range 0600 - 06E0 is the code point range of Arabic characters and symbols (See Unicode tables)

范围 0600 - 06E0 是阿拉伯字符和符号的代码点范围（请参阅Unicode 表）

Answer 2

回答by Gaurav Tyagi

Java in itself supports various language checks by unicode, Arabic is also supported. Much simpler and smallest way to do the same is by UnicodeBlock

Java 本身支持通过 unicode 进行的各种语言检查，也支持阿拉伯语。更简单和最小的方法是通过 UnicodeBlock

public static boolean textContainsArabic(String text) {
    for (char charac : text.toCharArray()) {
        if (Character.UnicodeBlock.of(charac) == Character.UnicodeBlock.ARABIC) {
            return true;
        }
    }
    return false;
}

Answer 3

回答by RamDroid

A minor change to cover all arabic characters and symbols range

涵盖所有阿拉伯字符和符号范围的小改动

private boolean isArabic(String text){
        String textWithoutSpace = text.trim().replaceAll(" ",""); //to ignore whitepace
        for (int i = 0; i < textWithoutSpace.length();) {
            int c = textWithoutSpace.codePointAt(i);
          //range of arabic chars/symbols is from 0x0600 to 0x06ff
            //the arabic letter '??' is special case having the range from 0xFE70 to 0xFEFF
            if (c >= 0x0600 && c <=0x06FF || (c >= 0xFE70 && c<=0xFEFF)) 
                i += Character.charCount(c);   
            else                
                return false;

        } 
        return true;
      }

Answer 4

回答by paxdiablo

You can usually tell by the code points within the string itself. Arabic occupies certain blocksin the Unicode code space.

您通常可以通过字符串本身的代码点来判断。阿拉伯语在 Unicode 代码空间中占据了某些块。

It's a fairly safe bet that, if a substantial proportion of the characters exist in those blocks (such as ???? ???????? ???? ?????????), it's Arabic text.

可以肯定的是，如果这些块中存在大量字符（例如???? ???????? ???? ?????????），则它是阿拉伯文本。

Answer 5

回答by Pranav V R

English characters tend to be in these 4 Unicode blocks:

英文字符往往在这 4 个 Unicode 块中：

BASIC_LATIN
LATIN_1_SUPPLEMENT
LATIN_EXTENDED_A

GENERAL_PUNCTUATION

public static boolean isEnglish(String text) {

 boolean onlyEnglish = false;

 for (char character : text.toCharArray()) {

    if (Character.UnicodeBlock.of(character) == Character.UnicodeBlock.BASIC_LATIN
            || Character.UnicodeBlock.of(character) == Character.UnicodeBlock.LATIN_1_SUPPLEMENT
            || Character.UnicodeBlock.of(character) == Character.UnicodeBlock.LATIN_EXTENDED_A
            || Character.UnicodeBlock.of(character) == Character.UnicodeBlock.GENERAL_PUNCTUATION) {

        onlyEnglish = true;
    } else {

        onlyEnglish = false;
    }
 }

  return onlyEnglish;
}

BASIC_LATIN
LATIN_1_SUPPLEMENT
LATIN_EXTENDED_A

GENERAL_PUNCTUATION

public static boolean isEnglish(String text) {

 boolean onlyEnglish = false;

 for (char character : text.toCharArray()) {

    if (Character.UnicodeBlock.of(character) == Character.UnicodeBlock.BASIC_LATIN
            || Character.UnicodeBlock.of(character) == Character.UnicodeBlock.LATIN_1_SUPPLEMENT
            || Character.UnicodeBlock.of(character) == Character.UnicodeBlock.LATIN_EXTENDED_A
            || Character.UnicodeBlock.of(character) == Character.UnicodeBlock.GENERAL_PUNCTUATION) {

        onlyEnglish = true;
    } else {

        onlyEnglish = false;
    }
 }

  return onlyEnglish;
}

Answer 6

回答by Iman Marashi

This answeris somewhat correct. But when we combine Farsi and English letters it returns TRUE!, which is not true. Here I modified the same method so that it works well

这个答案有点正确。但是当我们组合波斯语和英语字母时，它返回TRUE！，这是不正确的。这里我修改了相同的方法，使其运行良好

 public static boolean isProbablyArabic(String s) {
    for (int i = 0; i < s.length();) {
        int c = s.codePointAt(i);
        if (!(c >= 0x0600 && c <= 0x06E0))
            return false;
        i += Character.charCount(c);
    }
    return true;
}

Answer 7

回答by Basile Starynkevitch

You could use N-gram-based text categorization(google for that phrase) but it is not a fail-proof technique, and it may require a not too short string.

您可以使用基于 N-gram 的文本分类（谷歌搜索该短语），但它不是一种防故障技术，它可能需要一个不太短的字符串。

You might also decide that a string with only ASCII letters is not Arabic.

您也可能认为只有 ASCII 字母的字符串不是阿拉伯语。

Answer 8

回答by Saeid

Try This :

试试这个：

internal static bool ContainsArabicLetters(string text)

{

foreach (char character in text.ToCharArray())
{
    if (character >= 0x600 && character <= 0x6ff)
        return true;
    if (character >= 0x750 && character <= 0x77f)
        return true;
    if (character >= 0xfb50 && character <= 0xfc3f)
        return true;
    if (character >= 0xfe70 && character <= 0xfefc)
        return true;
}
return false;
}

java 如何确定一个字符串是英语还是阿拉伯语？

提问by Victor S

回答by Eyal Schneider

回答by Gaurav Tyagi

回答by RamDroid

回答by paxdiablo

回答by Pranav V R

回答by Iman Marashi

回答by Basile Starynkevitch

回答by Saeid

相关推荐

最近更新

标签

java 如何确定一个字符串是英语还是阿拉伯语？

提问by Victor S

回答by Eyal Schneider

回答by Gaurav Tyagi

回答by RamDroid

回答by paxdiablo

回答by Pranav V R

回答by Iman Marashi

回答by Basile Starynkevitch

回答by Saeid

相关推荐

java 初始化类和实例化对象的区别？

java 更新数据库后如何让JTable显示刷新的数据？

java 无法解析符号“StringUtils”

java 如何提取 HTML 标签以仅获取某些信息？

相关推荐

最近更新

标签