java 如何将日语字符分类为汉字或假名？

Question

提问by alex2k8

Given the text below, how can I classify each character as kanaor kanji?

鉴于下面的文本，我如何将每个字符分类为假名或汉字？

誰か確認上記これらのフ

谁か确认上记これらのフ

To get some thing like this

得到这样的东西

誰 - kanji
か - kana
確 - kanji
認 - kanji 
上 - kanji 
記 - kanji 
こ - kana 
れ - kana
ら - kana
の - kana
フ - kana

(Sorry if I did it incorrectly.)

（对不起，如果我做错了。）

Answer 1

回答by Josh Lee

This functionality is built into the Character.UnicodeBlockclass. Some examples of the Unicode blocks related to the Japanese language:

此功能内置于Character.UnicodeBlock类中。与日语相关的 Unicode 块的一些示例：

Character.UnicodeBlock.of('誰') == CJK_UNIFIED_IDEOGRAPHS
Character.UnicodeBlock.of('か') == HIRAGANA
Character.UnicodeBlock.of('フ') == KATAKANA
Character.UnicodeBlock.of('?') == HALFWIDTH_AND_FULLWIDTH_FORMS
Character.UnicodeBlock.of('！') == HALFWIDTH_AND_FULLWIDTH_FORMS
Character.UnicodeBlock.of('。') == CJK_SYMBOLS_AND_PUNCTUATION

But, as always, the devil is in the details:

但是，一如既往，细节决定成败：

Character.UnicodeBlock.of('Ａ') == HALFWIDTH_AND_FULLWIDTH_FORMS

where Ａis the full-width character. So this is in the same category as the halfwidth Katakana ?above. Note that the full-width Ａis different from the normal (half-width) A:

哪里Ａ是全角字符。所以这与?上面的半角片假名属于同一类别。请注意，全角Ａ与正常（半角）不同A：

Character.UnicodeBlock.of('A') == BASIC_LATIN

Answer 2

回答by Hyman

Use a table like thisone to determine which unicode values are used for katakana and kanji, then you can simply cast the character to an int and check where it belongs, something like

使用表像这样一个以确定哪些Unicode值用于片假名和汉字，那么你可以简单地把字符为int，并检查它所属的地方，像

int val = (int)て;
if (val >= 0x3040 && val <= 0x309f)
  return KATAKANA
..

Answer 3

回答by ColinD

This seems like it'd be an interesting use for Guava's CharMatcherclass. Using the tables linked in Hyman's answer, I created this:

这似乎是Guava的CharMatcher类的一个有趣用途。使用Hyman的答案中链接的表格，我创建了这个：

public class JapaneseCharMatchers {
  public static final CharMatcher HIRAGANA = 
      CharMatcher.inRange((char) 0x3040, (char) 0x309f);

  public static final CharMatcher KATAKANA = 
      CharMatcher.inRange((char) 0x30a0, (char) 0x30ff);

  public static final CharMatcher KANA = HIRAGANA.or(KATAKANA);

  public static final CharMatcher KANJI = 
      CharMatcher.inRange((char) 0x4e00, (char) 0x9faf);

  public static void main(String[] args) {
    test("誰か確認上記これらのフ");
  }

  private static void test(String string) {
    System.out.println(string);
    System.out.println("Hiragana: " + HIRAGANA.retainFrom(string));
    System.out.println("Katakana: " + KATAKANA.retainFrom(string));
    System.out.println("Kana: " + KANA.retainFrom(string));
    System.out.println("Kanji: " + KANJI.retainFrom(string));
  }
}

Running this prints the expected:

运行此打印预期：

誰か確認上記これらのフ
Hiragana: かこれらの
Katakana: フ
Kana: かこれらのフ
Kanji: 誰確認上記

谁か确认上记これらのフ
平假名：かこれらの
片假名：fu
假名：かこれらのフ
汉字：谁确认上记

This gives you a lot of power for working with Japanese text by defining the rules for determining if a character is in one of these groups in an object that can not only do a lot of useful things itself, but can also be used with other APIs such as Guava's Splitterclass.

通过定义确定字符是否属于对象中的这些组之一的规则，这为您提供了处理日语文本的强大功能，该对象不仅可以自己做很多有用的事情，还可以与其他 API 一起使用比如番石榴的Splitter课。

Edit:

编辑：

Based on jleedev's answer, you could also write a method like:

根据 jleedev 的回答，您还可以编写如下方法：

public static CharMatcher inUnicodeBlock(final Character.UnicodeBlock block) {
  return new CharMatcher() {
    public boolean matches(char c) {
      return Character.UnicodeBlock.of(c) == block;
    }
  };
}

and use it like:

并像这样使用它：

CharMatcher HIRAGANA = inUnicodeBlock(Character.UnicodeBlock.HIRAGANA);

I think this might be a bit slower than the other version though.

我认为这可能比其他版本慢一点。

Answer 4

回答by mP.

You need to get a reference that gives the separate ranges for kana and kanji characters. From what I've seen, alphabets and equivalents typically get a block of characters.

您需要获得一个提供假名和汉字字符的单独范围的参考。据我所知，字母和等价物通常会得到一个字符块。

Answer 5

回答by aevanko

I know you didn't ask for VBA, but here is the VBA flavor for those who want to know:

我知道您没有要求 VBA，但对于那些想知道的人，这里有 VBA 的味道：

Here's a function that will do it. It will break down the sentence like you have above into a single cell. You might need to add some error checking for how you want to deal with line breaks or English characters, etc. but this should be a good start.

这是一个可以做到这一点的函数。它会将上面的句子分解成一个单元格。您可能需要添加一些错误检查以了解如何处理换行符或英文字符等，但这应该是一个好的开始。

Function KanjiKanaBreakdown(ByVal text As String) As String

Application.ScreenUpdating = False
Dim kanjiCode As Long
Dim result As String
Dim i As Long

For i = 1 To Len(text)
    If Asc(Mid$(text, i, 1)) > -30562 And Asc(Mid$(text, i, 1)) < -950 Then
        result = (result & (Mid$(text, i, 1)) & (" - kanji") & vbLf)
    Else
        result = (result & (Mid$(text, i, 1)) & (" - kana") & vbLf)
    End If
Next

KanjiKanaBreakdown = result
Application.ScreenUpdating = True

End Function

java 如何将日语字符分类为汉字或假名？

提问by alex2k8

回答by Josh Lee

回答by Hyman

回答by ColinD

回答by mP.

回答by aevanko

相关推荐

最近更新

标签

java 如何将日语字符分类为汉字或假名？

提问by alex2k8

回答by Josh Lee

回答by Hyman

回答by ColinD

回答by mP.

回答by aevanko

相关推荐

使用 delete() 删除文件 - Java

java Google Collections ImmutableMap 迭代顺序

在 Android Java 代码中使用泛型

java 没有可用的 JTA UserTransaction - 指定“userTransaction”或“userTransactionName”

相关推荐

最近更新

标签