Java 将符号、重音字母转换为英文字母

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1008802/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 22:13:30  来源:igfitidea点击:

Converting Symbols, Accent Letters to English Alphabet

javaunicodespecial-charactersdiacritics

提问by AhmetB - Google

The problem is that, as you know, there are thousands of characters in the Unicode chartand I want to convert all the similar characters to the letters which are in English alphabet.

问题是,如您所知,Unicode 图表中有数千个字符,我想将所有相似的字符转换为英文字母表中的字母。

For instance here are a few conversions:

例如,这里有一些转换:

?->H
?->V
?->Y
?->O
?->C
t?? ?????y --> the Family
...

and I saw that there are more than 20 versions of letter A/a. and I don't know how to classify them. They look like needles in the haystack.

我看到字母A/a有20多个版本。我不知道如何分类。它们看起来就像大海捞针。

The complete list of unicode chars is at http://www.ssec.wisc.edu/~tomw/java/unicode.htmlor http://unicode.org/charts/charindex.html. Just try scrolling down and see the variations of letters.

Unicode 字符的完整列表位于http://www.ssec.wisc.edu/~tomw/java/unicode.htmlhttp://unicode.org/charts/charindex.html。只需尝试向下滚动并查看字母的变化。

How can I convert all these with Java? Please help me :(

如何使用 Java 转换所有这些?请帮我 :(

采纳答案by hashable

Reposting my post from How do I remove diacritics (accents) from a string in .NET?

如何从 .NET 中的字符串中删除变音符号(重音)?

This method works fine in java (purely for the purpose of removing diacritical marks aka accents).

这种方法在 java 中工作正常(纯粹是为了去除变音符号又名重音)

It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.

它基本上将所有重音字符转换为它们的 deAccented 对应物,然后是它们的组合变音符号。现在您可以使用正则表达式去除变音符号。

import java.text.Normalizer;
import java.util.regex.Pattern;

public String deAccent(String str) {
    String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); 
    Pattern pattern = Pattern.compile("\p{InCombiningDiacriticalMarks}+");
    return pattern.matcher(nfdNormalizedString).replaceAll("");
}

回答by Dour High Arch

The problem with "converting" arbitrary Unicode to ASCII is that the meaning of a character is culture-dependent. For example, “?” to a German-speaking person should be converted to "ss" while an English-speaker would probably convert it to “B”.

将任意 Unicode“转换”为 ASCII 的问题在于字符的含义取决于文化。例如, ”?” 讲德语的人应该转换为“ss”,而讲英语的人可能会将其转换为“B”。

Add to that the fact that Unicode has multiple code points for the same glyphs.

此外,Unicode 对相同的字形有多个代码点。

The upshot is that the only way to do this is create a massive table with each Unicode character and the ASCII character you want to convert it to. You can take a shortcut by normalizing characters with accents to normalization form KD, but not all characters normalize to ASCII. In addition, Unicode does not define which parts of a glyph are "accents".

结果是,唯一的方法是创建一个包含每个 Unicode 字符和您想要将其转换为的 ASCII 字符的大表。您可以通过将带有重音符号的字符标准化为 KD 的标准化形式来走捷径,但并非所有字符都标准化为 ASCII。此外,Unicode 没有定义字形的哪些部分是“重音”。

Here is a tiny excerpt from an app that does this:

这是执行此操作的应用程序的一小段摘录:

switch (c)
{
    case 'A':
    case '\u00C0':  //  à LATIN CAPITAL LETTER A WITH GRAVE
    case '\u00C1':  //  á LATIN CAPITAL LETTER A WITH ACUTE
    case '\u00C2':  //  ? LATIN CAPITAL LETTER A WITH CIRCUMFLEX
    // and so on for about 20 lines...
        return "A";
        break;

    case '\u00C6'://  ? LATIN CAPITAL LIGATURE AE
        return "AE";
        break;

    // And so on for pages...
}

回答by Daniel Vandersluis

You could try using unidecode, which is available as a ruby gemand as a perl module on cpan. Essentially, it works as a huge lookup table, where each unicode code point relates to an ascii character or string.

您可以尝试使用unidecode,它可以作为ruby gem和作为cpan 上perl 模块使用。本质上,它用作一个巨大的查找表,其中每个 unicode 代码点与一个 ascii 字符或字符串相关。

回答by Ian

Attempting to "convert them all" is the wrong approach to the problem.

试图“全部转换”是解决问题的错误方法。

Firstly, you need to understand the limitations of what you are trying to do. As others have pointed out, diacritics are there for a reason: they are essentially unique letters in the alphabet of that language with their own meaning / sound etc.: removing those marks is just the same as replacing random letters in an English word. This is before you even go onto consider the Cyrillic languages and other script based texts such as Arabic, which simply cannot be "converted" to English.

首先,您需要了解您尝试做的事情的局限性。正如其他人指出的那样,变音符号的存在是有原因的:它们本质上是该语言字母表中的唯一字母,具有自己的含义/声音等:删除这些标记与替换英语单词中的随机字母相同。这是在您甚至开始考虑西里尔语和其他基于脚本的文本(例如阿拉伯语)之前,它们根本无法“转换”为英语。

If you must, for whatever reason, convert characters, then the only sensible way to approach this it to firstly reduce the scope of the task at hand. Consider the source of the input - if you are coding an application for "the Western world" (to use as good a phrase as any), it would be unlikely that you would ever need to parse Arabic characters. Similarly, the Unicode character set contains hundreds of mathematical and pictorial symbols: there is no (easy) way for users to directly enter these, so you can assume they can be ignored.

如果您出于某种原因必须转换字符,那么解决此问题的唯一明智方法是首先缩小手头任务的范围。考虑输入的来源 - 如果您正在为“西方世界”编写应用程序(使用尽可能好的短语),则不太可能需要解析阿拉伯字符。同样,Unicode 字符集包含数百个数学和图形符号:用户没有(简单)方法可以直接输入这些符号,因此您可以假设它们可以被忽略。

By taking these logical steps you can reduce the number of possible characters to parse to the point where a dictionary based lookup / replace operation is feasible. It then becomes a small amount of slightly boring work creating the dictionaries, and a trivial task to perform the replacement. If your language supports native Unicode characters (as Java does) and optimises static structures correctly, such find and replaces tend to be blindingly quick.

通过采取这些逻辑步骤,您可以将要解析的可能字符数减少到基于字典的查找/替换操作可行的程度。然后它变成了少量略显枯燥的创建字典的工作,以及执行替换的微不足道的任务。如果您的语言支持原生 Unicode 字符(如 Java 那样)并正确优化静态结构,则此类查找和替换往往会非常快。

This comes from experience of having worked on an application that was required to allow end users to search bibliographic data that included diacritic characters. The lookup arrays (as it was in our case) took perhaps 1 man day to produce, to cover all diacritic marks for all Western European languages.

这来自于开发一个应用程序的经验,该应用程序需要允许最终用户搜索包含变音符号的书目数据。查找数组(就像我们的例子一样)大约需要 1 个工作日才能生成,以覆盖所有西欧语言的所有变音符号。

回答by RealHowTo

If the need is to convert "òé????->oeisoc", you can use this a starting point :

如果需要转换“òé????->oeisoc”,您可以使用它作为起点:

public class AsciiUtils {
    private static final String PLAIN_ASCII =
      "AaEeIiOoUu"    // grave
    + "AaEeIiOoUuYy"  // acute
    + "AaEeIiOoUuYy"  // circumflex
    + "AaOoNn"        // tilde
    + "AaEeIiOoUuYy"  // umlaut
    + "Aa"            // ring
    + "Cc"            // cedilla
    + "OoUu"          // double acute
    ;

    private static final String UNICODE =
     "\u00C0\u00E0\u00C8\u00E8\u00CC\u00EC\u00D2\u00F2\u00D9\u00F9"             
    + "\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD" 
    + "\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177" 
    + "\u00C3\u00E3\u00D5\u00F5\u00D1\u00F1"
    + "\u00C4\u00E4\u00CB\u00EB\u00CF\u00EF\u00D6\u00F6\u00DC\u00FC\u0178\u00FF" 
    + "\u00C5\u00E5"                                                             
    + "\u00C7\u00E7" 
    + "\u0150\u0151\u0170\u0171" 
    ;

    // private constructor, can't be instanciated!
    private AsciiUtils() { }

    // remove accentued from a string and replace with ascii equivalent
    public static String convertNonAscii(String s) {
       if (s == null) return null;
       StringBuilder sb = new StringBuilder();
       int n = s.length();
       for (int i = 0; i < n; i++) {
          char c = s.charAt(i);
          int pos = UNICODE.indexOf(c);
          if (pos > -1){
              sb.append(PLAIN_ASCII.charAt(pos));
          }
          else {
              sb.append(c);
          }
       }
       return sb.toString();
    }

    public static void main(String args[]) {
       String s = 
         "The result : è,é,ê,?,?,ù,?,?,à,?,?,è,é,ê,?,?,ù,?,?,à,a,?,?";
       System.out.println(AsciiUtils.convertNonAscii(s));
       // output : 
       // The result : E,E,E,E,U,U,I,I,A,A,O,e,e,e,e,u,u,i,i,a,a,o,c
    }
}

The JDK 1.6 provides the java.text.Normalizer class that can be used for this task.

JDK 1.6 提供了可用于此任务的 java.text.Normalizer 类。

See an example here

在此处查看示例

回答by JacquesB

There is no easy or general way to do what you want because it is just your subjective opinion that these letters look loke the latin letters you want to convert to. They are actually separate letters with their own distinct names and sounds which just happen to superficially look like a latin letter.

没有简单或通用的方法来做你想做的事,因为这些字母看起来像你想转换成的拉丁字母只是你的主观意见。它们实际上是具有自己独特名称和发音的独立字母,只是表面上看起来像一个拉丁字母。

If you want that conversion, you have to create your own translation table based on what latin letters you think the non-latin letters should be converted to.

如果你想要这种转换,你必须根据你认为非拉丁字母应该转换成的拉丁字母来创建你自己的翻译表。

(If you only want to remove diacritial marks, there are some answers in this thread: How do I remove diacritics (accents) from a string in .NET?However you describe a more general problem)

(如果您只想删除变音符号,此线程中有一些答案:如何从 .NET 中的字符串中删除变音符号(重音)?但是您描述了一个更一般的问题)

回答by Joachim Sauer

Since the encoding that turns "the Family" into "t?? ?????y" is effectively random and not following any algorithm that can be explained by the information of the Unicode codepoints involved, there's no general way to solve this algorithmically.

由于将“家庭”变成“t?? ?????y”的编码实际上是随机的,并且不遵循任何可以通过所涉及的 Unicode 代码点信息来解释的算法,因此没有通用的算法来解决这个问题.

You will need to build the mapping of Unicode characters into latin characters which they resemble. You could probably do this with some smart machine learning on the actual glyphs representing the Unicode codepoints. But I think the effort for this would be greater than manually building that mapping. Especially if you have a good amount of examples from which you can build your mapping.

您需要将 Unicode 字符映射到它们相似的拉丁字符。您可能可以通过对代表 Unicode 代码点的实际字形进行一些智能机器学习来做到这一点。但我认为为此付出的努力会比手动构建该映射更大。特别是如果您有大量示例可以从中构建映射。

To clarify: a few of the substitutions can actually be solved via the Unicode data (as the other answers demonstrate), but some letters simply have no reasonable association with the latin characters which they resemble.

澄清:一些替换实际上可以通过 Unicode 数据解决(如其他答案所示),但有些字母与它们相似的拉丁字符没有合理的关联。

Examples:

例子:

  • "?" (U+0452 CYRILLIC SMALL LETTER DJE) is more related to "d" than to "h", but is used to represent "h".
  • "?" (U+0166 LATIN CAPITAL LETTER T WITH STROKE) is somewhat related to "T" (as the name suggests) but is used to represent "F".
  • "?" (U+0E04 THAI CHARACTER KHO KHWAI) is not related to any latin character at all and in your example is used to represent "a"
  • “?” (U+0452 西里尔小写字母 DGE) 与“d”的关系比与“h”的关系更大,但用于表示“h”。
  • “?” (U+0166 LATIN CAPITAL LETTER T WITH STROKE) 与“T”(顾名思义)有些相关,但用于表示“F”。
  • “?” (U+0E04 THAI CHARACTER KHO KHWAI) 根本与任何拉丁字符无关,在您的示例中用于表示“a”

回答by Ondra ?i?ka

It's a part of Apache Commons Langas of ver. 3.0.

从版本开始,它是Apache Commons Lang的一部分。3.0.

org.apache.commons.lang3.StringUtils.stripAccents("A?");

returns An

返回 An

Also see http://www.drillio.com/en/software-development/java/removing-accents-diacritics-in-any-language/

另见http://www.drillio.com/en/software-development/java/removing-accents-diacritics-in-any-language/

回答by Dayanand Gowda

The original request has been answered already.

原始请求已得到答复。

However, I am posting the below answer for those who might be looking for generic transliteration code to transliterate any charset to Latin/English in Java.

但是,我为那些可能正在寻找通用音译代码以将任何字符集音译为 Java 中的拉丁语/英语的人发布以下答案。

Naive meaning of tranliteration: Translated string in it's final form/target charset sounds like the string in it's original form. If we want to transliterate any charset to Latin(English alphabets), then ICU4(ICU4J library in java ) will do the job.

音译的天真含义:最终形式/目标字符集的翻译字符串听起来像原始形式的字符串。如果我们想将任何字符集音译为拉丁语(英文字母),那么 ICU4(Java 中的 ICU4J 库)将完成这项工作。

Here is the code snippet in java:

这是java中的代码片段:

    import com.ibm.icu.text.Transliterator; //ICU4J library import

    public static String TRANSLITERATE_ID = "NFD; Any-Latin; NFC";
    public static String NORMALIZE_ID = "NFD; [:Nonspacing Mark:] Remove; NFC";

    /**
    * Returns the transliterated string to convert any charset to latin.
    */
    public static String transliterate(String input) {
        Transliterator transliterator = Transliterator.getInstance(TRANSLITERATE_ID + "; " + NORMALIZE_ID);
        String result = transliterator.transliterate(input);
        return result;
    }

回答by Francisco Junior

I'm late to the party, but after facing this issue today, I found this answer to be very good:

我参加聚会迟到了,但今天面对这个问题后,我发现这个答案非常好:

String asciiName = Normalizer.normalize(unicodeName, Normalizer.Form.NFD)
    .replaceAll("[^\p{ASCII}]", "");

Reference: https://stackoverflow.com/a/16283863

参考:https: //stackoverflow.com/a/16283863