java 如何将日语字符分类为汉字或假名?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3826918/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 03:33:49  来源:igfitidea点击:

How to classify Japanese characters as either kanji or kana?

javaunicodecjk

提问by alex2k8

Given the text below, how can I classify each character as kanaor kanji?

鉴于下面的文本,我如何将每个字符分类为假名汉字

誰か確認上記これらのフ

谁か确认上记これらのフ

To get some thing like this

得到这样的东西

誰 - kanji
か - kana
確 - kanji
認 - kanji 
上 - kanji 
記 - kanji 
こ - kana 
れ - kana
ら - kana
の - kana
フ - kana

(Sorry if I did it incorrectly.)

(对不起,如果我做错了。)

回答by Josh Lee

This functionality is built into the Character.UnicodeBlockclass. Some examples of the Unicode blocks related to the Japanese language:

此功能内置于Character.UnicodeBlock类中。与日语相关的 Unicode 块的一些示例:

Character.UnicodeBlock.of('誰') == CJK_UNIFIED_IDEOGRAPHS
Character.UnicodeBlock.of('か') == HIRAGANA
Character.UnicodeBlock.of('フ') == KATAKANA
Character.UnicodeBlock.of('?') == HALFWIDTH_AND_FULLWIDTH_FORMS
Character.UnicodeBlock.of('!') == HALFWIDTH_AND_FULLWIDTH_FORMS
Character.UnicodeBlock.of('。') == CJK_SYMBOLS_AND_PUNCTUATION

But, as always, the devil is in the details:

但是,一如既往,细节决定成败:

Character.UnicodeBlock.of('A') == HALFWIDTH_AND_FULLWIDTH_FORMS

where is the full-width character. So this is in the same category as the halfwidth Katakana ?above. Note that the full-width is different from the normal (half-width) A:

哪里是全角字符。所以这与?上面的半角片假名属于同一类别。请注意,全角与正常(半角)不同A

Character.UnicodeBlock.of('A') == BASIC_LATIN

回答by Hyman

Use a table like thisone to determine which unicode values are used for katakana and kanji, then you can simply cast the character to an int and check where it belongs, something like

使用表像这样一个以确定哪些Unicode值用于片假名和汉字,那么你可以简单地把字符为int,并检查它所属的地方,像

int val = (int)て;
if (val >= 0x3040 && val <= 0x309f)
  return KATAKANA
..

回答by ColinD

This seems like it'd be an interesting use for Guava's CharMatcherclass. Using the tables linked in Hyman's answer, I created this:

这似乎是GuavaCharMatcher类的一个有趣用途。使用Hyman的答案中链接的表格,我创建了这个:

public class JapaneseCharMatchers {
  public static final CharMatcher HIRAGANA = 
      CharMatcher.inRange((char) 0x3040, (char) 0x309f);

  public static final CharMatcher KATAKANA = 
      CharMatcher.inRange((char) 0x30a0, (char) 0x30ff);

  public static final CharMatcher KANA = HIRAGANA.or(KATAKANA);

  public static final CharMatcher KANJI = 
      CharMatcher.inRange((char) 0x4e00, (char) 0x9faf);

  public static void main(String[] args) {
    test("誰か確認上記これらのフ");
  }

  private static void test(String string) {
    System.out.println(string);
    System.out.println("Hiragana: " + HIRAGANA.retainFrom(string));
    System.out.println("Katakana: " + KATAKANA.retainFrom(string));
    System.out.println("Kana: " + KANA.retainFrom(string));
    System.out.println("Kanji: " + KANJI.retainFrom(string));
  }
}

Running this prints the expected:

运行此打印预期:

誰か確認上記これらのフ

Hiragana: かこれらの

Katakana: フ

Kana: かこれらのフ

Kanji: 誰確認上記

谁か确认上记これらのフ

平假名:かこれらの

片假名:fu

假名:かこれらのフ

汉字:谁确认上记

This gives you a lot of power for working with Japanese text by defining the rules for determining if a character is in one of these groups in an object that can not only do a lot of useful things itself, but can also be used with other APIs such as Guava's Splitterclass.

通过定义确定字符是否属于对象中的这些组之一的规则,这为您提供了处理日语文本的强大功能,该对象不仅可以自己做很多有用的事情,还可以与其他 API 一起使用比如番石榴的Splitter课。

Edit:

编辑:

Based on jleedev's answer, you could also write a method like:

根据 jleedev 的回答,您还可以编写如下方法:

public static CharMatcher inUnicodeBlock(final Character.UnicodeBlock block) {
  return new CharMatcher() {
    public boolean matches(char c) {
      return Character.UnicodeBlock.of(c) == block;
    }
  };
}

and use it like:

并像这样使用它:

CharMatcher HIRAGANA = inUnicodeBlock(Character.UnicodeBlock.HIRAGANA);

I think this might be a bit slower than the other version though.

我认为这可能比其他版本慢一点。

回答by mP.

You need to get a reference that gives the separate ranges for kana and kanji characters. From what I've seen, alphabets and equivalents typically get a block of characters.

您需要获得一个提供假名和汉字字符的单独范围的参考。据我所知,字母和等价物通常会得到一个字符块。

回答by aevanko

I know you didn't ask for VBA, but here is the VBA flavor for those who want to know:

我知道您没有要求 VBA,但对于那些想知道的人,这里有 VBA 的味道:

Here's a function that will do it. It will break down the sentence like you have above into a single cell. You might need to add some error checking for how you want to deal with line breaks or English characters, etc. but this should be a good start.

这是一个可以做到这一点的函数。它会将上面的句子分解成一个单元格。您可能需要添加一些错误检查以了解如何处理换行符或英文字符等,但这应该是一个好的开始。

Function KanjiKanaBreakdown(ByVal text As String) As String

Application.ScreenUpdating = False
Dim kanjiCode As Long
Dim result As String
Dim i As Long

For i = 1 To Len(text)
    If Asc(Mid$(text, i, 1)) > -30562 And Asc(Mid$(text, i, 1)) < -950 Then
        result = (result & (Mid$(text, i, 1)) & (" - kanji") & vbLf)
    Else
        result = (result & (Mid$(text, i, 1)) & (" - kana") & vbLf)
    End If
Next

KanjiKanaBreakdown = result
Application.ScreenUpdating = True

End Function