Java 有没有办法摆脱重音并将整个字符串转换为常规字母?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3322152/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 22:07:07  来源:igfitidea点击:

Is there a way to get rid of accents and convert a whole string to regular letters?

javastringdiacritics

提问by Martin

Is there a better way for getting rid of accents and making those letters regular apart from using String.replaceAll()method and replacing letters one by one? Example:

除了使用String.replaceAll()方法和一个一个替换字母之外,有没有更好的方法来摆脱重音并使这些字母规则?例子:

Input: or?p?síáyd

输入: or?p?síáyd

Output: orcpzsiayd

输出: orcpzsiayd

It doesn't need to include all letters with accents like the Russian alphabet or the Chinese one.

它不需要包含所有带有重音符号的字母,如俄语字母或中文字母。

采纳答案by Erick Robertson

Use java.text.Normalizerto handle this for you.

使用java.text.Normalizer来处理这个给你。

string = Normalizer.normalize(string, Normalizer.Form.NFD);
// or Normalizer.Form.NFKD for a more "compatable" deconstruction 

This will separate all of the accent marks from the characters. Then, you just need to compare each character against being a letter and throw out the ones that aren't.

这会将所有重音符号与字符分开。然后,您只需要将每个字符与字母进行比较,然后将不是的扔掉。

string = string.replaceAll("[^\p{ASCII}]", "");

If your text is in unicode, you should use this instead:

如果你的文本是 unicode,你应该使用它:

string = string.replaceAll("\p{M}", "");

For unicode, \\P{M}matches the base glyph and \\p{M}(lowercase) matches each accent.

对于 unicode,\\P{M}匹配基本字形并且\\p{M}(小写)匹配每个重音符号。

Thanks to GarretWilson for the pointer and regular-expressions.infofor the great unicode guide.

由于GarretWilson的指针和regular-expressions.info为伟大的Unicode指南。

回答by NinjaCat

Depending on the language, those might not be considered accents (which change the sound of the letter), but diacritical marks

根据语言的不同,这些可能不被视为重音(改变字母的声音),而是变音符号

https://en.wikipedia.org/wiki/Diacritic#Languages_with_letters_containing_diacritics

https://en.wikipedia.org/wiki/Diacritic#Languages_with_letters_ contains_diacritics

"Bosnian and Croatian have the symbols ?, ?, ?, ? and ?, which are considered separate letters and are listed as such in dictionaries and other contexts in which words are listed according to alphabetical order."

“波斯尼亚语和克罗地亚语有符号 ?, ?, ?, ? 和 ?,它们被认为是单独的字母,并在字典和其他按字母顺序列出单词的上下文中列出。”

Removing them might be inherently changing the meaning of the word, or changing the letters into completely different ones.

删除它们可能会本质上改变单词的含义,或者将字母更改为完全不同的字母。

回答by Nico

System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("\p{InCombiningDiacriticalMarks}+", ""));

worked for me. The output of the snippet above gives "aee" which is what I wanted, but

为我工作。上面代码片段的输出给出了我想要的“aee”,但是

System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("[^\p{ASCII}]", ""));

didn't do any substitution.

没有做任何替换。

回答by virgo47

EDIT: If you're not stuck with Java <6 and speed is not critical and/or translation table is too limiting, use answer by David. The point is to use Normalizer(introduced in Java 6) instead of translation table inside the loop.

编辑:如果你没有坚持使用 Java <6 并且速度不是关键和/或翻译表太有限,请使用大卫的答案。重点是Normalizer在循环内使用(在 Java 6 中引入)而不是转换表。

While this is not "perfect" solution, it works well when you know the range (in our case Latin1,2), worked before Java 6 (not a real issue though) and is much faster than the most suggested version (may or may not be an issue):

虽然这不是“完美”的解决方案,但当您知道范围(在我们的例子中为 Latin1,2),在 Java 6 之前工作(虽然不是真正的问题)并且比大多数建议的版本快得多(可能或可能不是问题):

    /**
 * Mirror of the unicode table from 00c0 to 017f without diacritics.
 */
private static final String tab00c0 = "AAAAAAACEEEEIIII" +
    "DNOOOOO\u00d7\u00d8UUUUYI\u00df" +
    "aaaaaaaceeeeiiii" +
    "\u00f0nooooo\u00f7\u00f8uuuuy\u00fey" +
    "AaAaAaCcCcCcCcDd" +
    "DdEeEeEeEeEeGgGg" +
    "GgGgHhHhIiIiIiIi" +
    "IiJjJjKkkLlLlLlL" +
    "lLlNnNnNnnNnOoOo" +
    "OoOoRrRrRrSsSsSs" +
    "SsTtTtTtUuUuUuUu" +
    "UuUuWwYyYZzZzZzF";

/**
 * Returns string without diacritics - 7 bit approximation.
 *
 * @param source string to convert
 * @return corresponding string without diacritics
 */
public static String removeDiacritic(String source) {
    char[] vysl = new char[source.length()];
    char one;
    for (int i = 0; i < source.length(); i++) {
        one = source.charAt(i);
        if (one >= '\u00c0' && one <= '\u017f') {
            one = tab00c0.charAt((int) one - '\u00c0');
        }
        vysl[i] = one;
    }
    return new String(vysl);
}

Tests on my HW with 32bit JDK show that this performs conversion from àèé????89FD? to aeelstc89FDC 1 million times in ~100ms while Normalizer way makes it in 3.7s (37x slower). In case your needs are around performance and you know the input range, this may be for you.

使用 32 位 JDK 对我的硬件进行的测试表明,这执行了从 àèé????89FD? 的转换。到 aeelstc89FDC 在大约 100 毫秒内达到 100 万次,而 Normalizer 方式使其在 3.7 秒内(慢 37 倍)。如果您的需求与性能有关并且您知道输入范围,这可能适合您。

Enjoy :-)

享受 :-)

回答by David Conrad

The solution by @virgo47 is very fast, but approximate. The accepted answer uses Normalizer and a regular expression. I wondered what part of the time was taken by Normalizer versus the regular expression, since removing all the non-ASCII characters can be done without a regex:

@virgo47 的解决方案非常快,但很近似。接受的答案使用 Normalizer 和正则表达式。我想知道 Normalizer 与正则表达式占用了哪一部分时间,因为可以在没有正则表达式的情况下删除所有非 ASCII 字符:

import java.text.Normalizer;

public class Strip {
    public static String flattenToAscii(String string) {
        StringBuilder sb = new StringBuilder(string.length());
        string = Normalizer.normalize(string, Normalizer.Form.NFD);
        for (char c : string.toCharArray()) {
            if (c <= '\u007F') sb.append(c);
        }
        return sb.toString();
    }
}

Small additional speed-ups can be obtained by writing into a char[] and not calling toCharArray(), although I'm not sure that the decrease in code clarity merits it:

通过写入 char[] 而不是调用 toCharArray() 可以获得额外的小速度提升,尽管我不确定代码清晰度的降低是否值得:

public static String flattenToAscii(String string) {
    char[] out = new char[string.length()];
    string = Normalizer.normalize(string, Normalizer.Form.NFD);
    int j = 0;
    for (int i = 0, n = string.length(); i < n; ++i) {
        char c = string.charAt(i);
        if (c <= '\u007F') out[j++] = c;
    }
    return new String(out);
}

This variation has the advantage of the correctness of the one using Normalizer and some of the speed of the one using a table. On my machine, this one is about 4x faster than the accepted answer, and 6.6x to 7x slower that @virgo47's (the accepted answer is about 26x slower than @virgo47's on my machine).

这种变化的优点是使用 Normalizer 的正确性和使用 table 的某些速度的正确性。在我的机器上,这个比接受的答案快 4 倍,比@virgo47 慢 6.6 到 7 倍(接受的答案比我机器上的 @virgo47 慢约 26 倍)。

回答by DavidS

As of 2011 you can use Apache Commons StringUtils.stripAccents(input)(since 3.0):

从 2011 年开始,您可以使用 Apache Commons StringUtils.stripAccents(input)(自 3.0 起):

    String input = StringUtils.stripAccents("T??? ?? a f?ň?? ????ń?");
    System.out.println(input);
    // Prints "This is a funky String"

Note:

笔记:

The accepted answer (Erick Robertson's) doesn't work for ? or ?. Apache Commons 3.5 doesn't work for ? either, but it does work for ?. After reading the Wikipedia article for ?, I'm not sure it should be replaced with "O": it's a separate letter in Norwegian and Danish, alphabetized after "z". It's a good example of the limitations of the "strip accents" approach.

接受的答案(埃里克·罗伯逊的)不适用于 ? 或者 ?。Apache Commons 3.5 不适用于 ? 要么,但它确实适用于?。阅读维基百科文章后?,我不确定它是否应该替换为“O”:它是挪威语和丹麦语中的一个单独字母,按字母顺序排列在“z”之后。这是“条带重音”方法局限性的一个很好的例子。

回答by Ricardo Freitas

@David Conrad solution is the fastest I tried using the Normalizer, but it does have a bug. It basically strips characters which are not accents, for example Chinese characters and other letters like ?, are all stripped. The characters that we want to strip are non spacing marks, characters which don't take up extra width in the final string. These zero width characters basically end up combined in some other character. If you can see them isolated as a character, for example like this `, my guess is that it's combined with the space character.

@David Conrad 解决方案是我使用 Normalizer 尝试过的最快的解决方案,但它确实有一个错误。它基本上剥离了不是重音的字符,例如汉字和其他字母,如?,都被剥离了。我们要去除的字符是非空格标记,在最终字符串中不占用额外宽度的字符。这些零宽度字符基本上最终组合在其他一些字符中。如果你能看到它们作为一个字符被隔离,例如像这样的`,我的猜测是它与空格字符组合在一起。

public static String flattenToAscii(String string) {
    char[] out = new char[string.length()];
    String norm = Normalizer.normalize(string, Normalizer.Form.NFD);

    int j = 0;
    for (int i = 0, n = norm.length(); i < n; ++i) {
        char c = norm.charAt(i);
        int type = Character.getType(c);

        //Log.d(TAG,""+c);
        //by Ricardo, modified the character check for accents, ref: http://stackoverflow.com/a/5697575/689223
        if (type != Character.NON_SPACING_MARK){
            out[j] = c;
            j++;
        }
    }
    //Log.d(TAG,"normalized string:"+norm+"/"+new String(out));
    return new String(out);
}

回答by Yash

I have faced the same issue related to Strings equality check, One of the comparing string has ASCII character code 128-255.

我遇到了与字符串相等性检查相关的相同问题,其中一个比较字符串的 ASCII 字符代码为 128-255

i.e., Non-breaking space - [Hex - A0] Space [Hex - 20]. To show Non-breaking space over HTML. I have used the following spacing entities. Their character and its bytes are like &emsp is very wide space[?]{-30, -128, -125}, &ensp is somewhat wide space[?]{-30, -128, -126}, &thinsp is narrow space[ ]{32} , Non HTML Space {}

String s1 = "My Sample Space Data", s2 = "My?Sample?Space?Data";
System.out.format("S1: %s\n", java.util.Arrays.toString(s1.getBytes()));
System.out.format("S2: %s\n", java.util.Arrays.toString(s2.getBytes()));

Output in Bytes:

S1: [77, 121, 32, 83, 97, 109, 112, 108, 101, 32, 83, 112, 97, 99, 101, 32, 68, 97, 116, 97]S2: [77, 121, -30, -128, -125, 83, 97, 109, 112, 108, 101, -30, -128, -125, 83, 112, 97, 99, 101, -30, -128, -125, 68, 97, 116, 97]

即,不间断空格 - [Hex - A0] 空格 [Hex - 20]。在 HTML 上显示不间断空格。我使用了以下内容spacing entities。他们的性格和它的字节就像&emsp is very wide space[?]{-30, -128, -125}, &ensp is somewhat wide space[?]{-30, -128, -126}, &thinsp is narrow space[ ]{32} , Non HTML Space {}

String s1 = "My Sample Space Data", s2 = "My?Sample?Space?Data";
System.out.format("S1: %s\n", java.util.Arrays.toString(s1.getBytes()));
System.out.format("S2: %s\n", java.util.Arrays.toString(s2.getBytes()));

以字节为单位的输出:

S1: [77, 121, 32, 83, 97, 109, 112, 108, 101, , 32, 83, 112, 97, 99, 101, 32, 68, 97, 116, 97] S2: [77, 1831 -30, -128, -125, 97, 109, 112, 108, 101, -30, -128, -125, 83, 112, 97, 99, 101, -30, -128, -125, 68, 97, 116, 97]

Use below code for Different Spaces and their Byte-Codes: wiki for List_of_Unicode_characters

对不同的空间及其字节码使用以下代码: wiki for List_of_Unicode_characters

String spacing_entities = "very?wide?space,narrow?space,regular space,invisible?separator";
System.out.println("Space String :"+ spacing_entities);
byte[] byteArray = 
    // spacing_entities.getBytes( Charset.forName("UTF-8") );
    // Charset.forName("UTF-8").encode( s2 ).array();
    {-30, -128, -125, 44, -30, -128, -126, 44, 32, 44, -62, -96};
System.out.println("Bytes:"+ Arrays.toString( byteArray ) );
try {
    System.out.format("Bytes to String[%S] \n ", new String(byteArray, "UTF-8"));
} catch (UnsupportedEncodingException e) {
    e.printStackTrace();
}
  • ? ASCII transliterations of Unicode string for Java. unidecode

    String initials = Unidecode.decode( s2 );
    
  • ? using Guava: Google Core Libraries for Java.

    String replaceFrom = CharMatcher.WHITESPACE.replaceFrom( s2, " " );
    

    For URL encode for the spaceuse Guava laibrary.

    String encodedString = UrlEscapers.urlFragmentEscaper().escape(inputString);
    
  • ? To overcome this problem used String.replaceAll()with some RegularExpression.

    // \p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
    s2 = s2.replaceAll("\p{Zs}", " ");
    
    
    s2 = s2.replaceAll("[^\p{ASCII}]", " ");
    s2 = s2.replaceAll("?", " ");
    
  • ? Using java.text.Normalizer.Form. This enum provides constants of the four Unicode normalization forms that are described in Unicode Standard Annex #15— Unicode Normalization Forms and two methods to access them.

    enter image description here

    s2 = Normalizer.normalize(s2, Normalizer.Form.NFKC);
    
  • ? Java 的 Unicode 字符串的 ASCII 音译。unidecode

    String initials = Unidecode.decode( s2 );
    
  • ? 使用Guava:谷歌核心Libraries for Java

    String replaceFrom = CharMatcher.WHITESPACE.replaceFrom( s2, " " );
    

    对于空间的URL 编码,请使用 Guava laibrary。

    String encodedString = UrlEscapers.urlFragmentEscaper().escape(inputString);
    
  • ? 为了克服这个问题,使用String.replaceAll()了一些RegularExpression.

    // \p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
    s2 = s2.replaceAll("\p{Zs}", " ");
    
    
    s2 = s2.replaceAll("[^\p{ASCII}]", " ");
    s2 = s2.replaceAll("?", " ");
    
  • ? 使用java.text.Normalizer.Form。此枚举提供了Unicode 标准附件 #15— Unicode 规范化形式中描述的四种 Unicode 规范化形式的常量以及访问它们的两种方法。

    在此处输入图片说明

    s2 = Normalizer.normalize(s2, Normalizer.Form.NFKC);
    


Testing String and outputs on different approaches like ? Unidecode, Normalizer, StringUtils.

在不同的方法上测试字符串和输出,比如?Unidecode, Normalizer, StringUtils

String strUni = "T??? ?? a f?ň?? ????ń? ?,?,D,?";

// This is a funky String AE,O,D,ss
String initials = Unidecode.decode( strUni );

// Following Produce this o/p: Th?i?s? i?s? a? fu?n?k?y? S?t?r?i?n?g? ?,?,D,?
String temp = Normalizer.normalize(strUni, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\p{InCombiningDiacriticalMarks}+");
temp = pattern.matcher(temp).replaceAll("");

String input = org.apache.commons.lang3.StringUtils.stripAccents( strUni );


Using Unidecodeis the best choice, My final Code shown below.

使用Unidecodebest choice,我的最终代码如下所示。

public static void main(String[] args) {
    String s1 = "My Sample Space Data", s2 = "My?Sample?Space?Data";
    String initials = Unidecode.decode( s2 );
    if( s1.equals(s2)) { //[?, ] %A0 - %2C - %20 ? http://www.ascii-code.com/
        System.out.println("Equal Unicode Strings");
    } else if( s1.equals( initials ) ) {
        System.out.println("Equal Non Unicode Strings");
    } else {
        System.out.println("Not Equal");
    }

}

回答by OlgaMaciaszek

I suggest Junidecode. It will handle not only '?' and '?', but it also works well for transcribing from other alphabets, such as Chinese, into Latin alphabet.

我建议Junidecode。它不仅会处理“?” 和“?”,但它也适用于从其他字母(如中文)转录成拉丁字母。

回答by Zhar

One of the best way using regex and Normalizerif you have no library is :

如果您没有库,使用 regex 和 Normalizer的最佳方法之一是:

    public String flattenToAscii(String s) {
                if(s == null || s.trim().length() == 0)
                        return "";
                return Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("[\u0300-\u036F]", "");
}

This is more efficient than replaceAll("[^\p{ASCII}]", "")) and if you don't need diacritics(just like your example).

这比 replaceAll("[^\p{ASCII}]", "")) 更有效,并且如果您不需要变音符号(就像您的示例一样)。

Otherwise, you have to use the p{ASCII} pattern.

否则,您必须使用 p{ASCII} 模式。

Regards.

问候。