java 将单词与特殊字符(é、è、...)进行比较时忽略变音符号

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3211974/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 00:49:55  来源:igfitidea点击:

Ignoring diacritic characters when comparing words with special characters (é, è, ...)

javaandroidstringreplacediacritics

提问by Waza_Be

I have a list with some Belgian cities with diacritic characters: (Liège, Quiévrain, Franière, etc.) and I would like to transform these special characters to compare with a list containing the same names in upper case, but without the diacritical marks (LIEGE, QUIEVRAIN, FRANIERE)

我有一些带有变音符号的比利时城市列表:(列日、魁夫兰、法兰尼埃等),我想转换这些特殊字符以与包含大写相同名称但没有变音符号的列表进行比较(列日、魁夫兰、弗兰尼尔)

What i first tried to do was to use the upper case:

我首先尝试做的是使用大写:

LIEGE.contentEqual(Liège.toUpperCase())but that doesn't fit because the Upper case of Liègeis LIèGEand not LIEGE.

LIEGE.contentEqual(Liège.toUpperCase())但这不合适,因为LiègeisLIèGE和 not的大写LIEGE

I have some complicated ideas like replacing each character, but that sounds stupid and a long process.

我有一些复杂的想法,比如替换每个角色,但这听起来很愚蠢,而且过程很长。

Any ideas on how to do this in a smart way?

关于如何以聪明的方式做到这一点的任何想法?

采纳答案by Pentium10

Check out this method in Java

在 Java 中查看此方法

private static final String PLAIN_ASCII = "AaEeIiOoUu" // grave
            + "AaEeIiOoUuYy" // acute
            + "AaEeIiOoUuYy" // circumflex
            + "AaOoNn" // tilde
            + "AaEeIiOoUuYy" // umlaut
            + "Aa" // ring
            + "Cc" // cedilla
            + "OoUu" // double acute
    ;

    private static final String UNICODE = "\u00C0\u00E0\u00C8\u00E8\u00CC\u00EC\u00D2\u00F2\u00D9\u00F9"
            + "\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD"
            + "\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177"
            + "\u00C3\u00E3\u00D5\u00F5\u00D1\u00F1"
            + "\u00C4\u00E4\u00CB\u00EB\u00CF\u00EF\u00D6\u00F6\u00DC\u00FC\u0178\u00FF"
            + "\u00C5\u00E5" + "\u00C7\u00E7" + "\u0150\u0151\u0170\u0171";

    /**
     * remove accented from a string and replace with ascii equivalent
     */
    public static String removeAccents(String s) {
        if (s == null)
            return null;
        StringBuilder sb = new StringBuilder(s.length());
        int n = s.length();
        int pos = -1;
        char c;
        boolean found = false;
        for (int i = 0; i < n; i++) {
            pos = -1;
            c = s.charAt(i);
            pos = (c <= 126) ? -1 : UNICODE.indexOf(c);
            if (pos > -1) {
                found = true;
                sb.append(PLAIN_ASCII.charAt(pos));
            } else {
                sb.append(c);
            }
        }
        if (!found) {
            return s;
        } else {
            return sb.toString();
        }
    }

回答by Stijn Van Bael

As of Java 6, you can use java.text.Normalizer:

从 Java 6 开始,您可以使用 java.text.Normalizer:

public String unaccent(String s) {
    String normalized = Normalizer.normalize(s, Normalizer.Form.NFD);
    return normalized.replaceAll("[^\p{ASCII}]", "");
}

Note that in Java 5 there is also a sun.text.Normalizer, but its use is strongly discouraged since it's part of Sun's proprietary API and has been removed in Java 6.

请注意,在 Java 5 中还有一个sun.text.Normalizer,但强烈建议不要使用它,因为它是 Sun 专有 API 的一部分,并且已在 Java 6 中删除。

回答by janb

This is the simplest solution I've found so far and it works perfectly in our applications.

这是迄今为止我找到的最简单的解决方案,它在我们的应用程序中完美运行。

Normalizer.normalize(string, Normalizer.Form.NFD).replaceAll("\p{InCombiningDiacriticalMarks}+", ""); 

But I don't know if the Normalizer is available on the Android platform.

但是不知道Android平台有没有Normalizer。

回答by tutejszy

If you still need that for Android API 8 or lower (Android 2.2, Java 1.5) where you don't have Normalizer class, here's my code, I think better to modify than Pentium10 answer:

如果您在没有 Normalizer 类的 Android API 8 或更低版本(Android 2.2、Java 1.5)中仍然需要它,这是我的代码,我认为比 Pentium10 答案更适合修改:

public class StringAccentRemover {

    @SuppressWarnings("serial")
    private static final HashMap<Character, Character> accents  = new HashMap<Character, Character>(){
        {
            put('?', 'A');
            put('?', 'E');
            put('?', 'C');
            put('?', 'L');
            put('?', 'N');
            put('ó', 'O');
            put('?', 'S');
            put('?', 'Z');
            put('?', 'Z');

            put('?', 'a');
            put('?', 'e');
            put('?', 'c');
            put('?', 'l');
            put('ń', 'n');
            put('ó', 'o');
            put('?', 's');
            put('?', 'z');
            put('?', 'z');
        }
    };
    /**
     * remove accented from a string and replace with ascii equivalent
     */
    public static String removeAccents(String s) {
        char[] result = s.toCharArray();
        for(int i=0; i<result.length; i++) {
            Character replacement = accents.get(result[i]);
            if (replacement!=null) result[i] = replacement;
        }
        return new String(result);
    }

}

回答by Jean-Philippe Caruana

The Collator class is a good way to do it (see corresponding javadoc). Here is a unit test that shows how to use it :

Collat​​or 类是一个很好的方法(请参阅相应的javadoc)。这是一个单元测试,展示了如何使用它:

import static org.junit.Assert.assertEquals;

import java.text.Collator;
import java.util.Locale;

import org.junit.Test;

public class CollatorTest {
    @Test public void liege() throws Exception {
        Collator compareOperator = Collator.getInstance(Locale.FRENCH);
        compareOperator.setStrength(Collator.PRIMARY);

        assertEquals(0, compareOperator.compare("Liege", "Liege")); // no accent
        assertEquals(0, compareOperator.compare("Liège", "Liege")); // with accent
        assertEquals(0, compareOperator.compare("LIEGE", "Liege")); // case insensitive
        assertEquals(0, compareOperator.compare("LIEGE", "Liège")); // case insensitive with accent

        assertEquals(1, compareOperator.compare("Liege", "Bruxelles"));
        assertEquals(-1, compareOperator.compare("Bruxelles", "Liege"));
    }
}


EDIT: sorry to see my answer did not meet your needs ; maybe it's beause I've presented it as unit test ? Is this ok for you ? I personnaly find it better because it's shortand it uses the SDK (no need for String replacement)

编辑:很抱歉看到我的回答没有满足您的需求;也许是因为我把它作为单元测试提出来了?这对你好吗?我个人觉得它更好,因为它很而且它使用 SDK(不需要替换字符串)

Collator compareOperator = Collator.getInstance(Locale.FRENCH);
compareOperator.setStrength(Collator.PRIMARY);
if (compareOperator.compare("Liège", "Liege") == 0) {
    // if we are here, then it's the "same" String
}

hope this helps

希望这可以帮助

回答by numéro6

I don't know if it is avaible on Android but on the JVM, you should notreimplement it in your project and reuse already existing code: just use org.apache.commons.lang3.StringUtils#stripAccents

我不知道这是否是avaible在Android,但在JVM上,你应该重新实现它在你的项目和重用现有代码:刚使用org.apache.commons.lang3.StringUtils#stripAccents

回答by Laurens

For those looking for a clean java solution, use apache commons:

对于那些寻找干净的 java 解决方案的人,请使用 apache commons:

StringUtils.stripAccents("Liège").toUpperCase();

this will return

这将返回

LIEGE

回答by Giorgio Barchiesi

Since class Normalizeris not supported in Froyo or previous Android versions, I have combined thisand this(which I both voted up), and optimized it, obtaining a couple of helper methods. Method unaccentifysimply converts diacritic chars to plain chars, while method slugifygenerates a slug for the input string. Hope it can be useful to someone. Here is the source code:

由于Froyo 或以前的 Android 版本不支持类Normalizer,因此我将thisthis(我都投了赞成票)结合起来,并对其进行了优化,获得了几个辅助方法。方法unaccentify只是将变音符号转换为普通字符,而方法slugify为输入字符串生成一个 slug。希望它可以对某人有用。这是源代码:

import java.util.Arrays;
import java.util.Locale;  
import java.util.regex.Pattern;  

public class SlugFroyo {
    private static final Pattern STRANGE = Pattern.compile("[^a-zA-Z0-9-]");
    private static final Pattern WHITESPACE = Pattern.compile("[\s]");

    private static final String DIACRITIC_CHARS = "\u00C0\u00E0\u00C8\u00E8\u00CC\u00EC\u00D2\u00F2\u00D9\u00F9"
            + "\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD"
            + "\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177"
            + "\u00C3\u00E3\u00D5\u00F5\u00D1\u00F1"
            + "\u00C4\u00E4\u00CB\u00EB\u00CF\u00EF\u00D6\u00F6\u00DC\u00FC\u0178\u00FF"
            + "\u00C5\u00E5" + "\u00C7\u00E7" + "\u0150\u0151\u0170\u0171";

    private static final String PLAIN_CHARS = "AaEeIiOoUu" // grave
            + "AaEeIiOoUuYy" // acute
            + "AaEeIiOoUuYy" // circumflex
            + "AaOoNn" // tilde
            + "AaEeIiOoUuYy" // umlaut
            + "Aa" // ring
            + "Cc" // cedilla
            + "OoUu"; // double acute

    private static char[] lookup = new char[0x180];

    static {
        Arrays.fill(lookup, (char) 0);
        for (int i = 0; i < DIACRITIC_CHARS.length(); i++)
            lookup[DIACRITIC_CHARS.charAt(i)] = PLAIN_CHARS.charAt(i);
    }

    public static String slugify(String s) {
        String nowhitespace = WHITESPACE.matcher(s).replaceAll("-");
        String unaccented = unaccentify(nowhitespace);
        String slug = STRANGE.matcher(unaccented).replaceAll("");
        return slug.toLowerCase(Locale.ENGLISH);
    }

    public static String unaccentify(String s) {
        StringBuilder sb = new StringBuilder(s);
        for (int i = 0; i < sb.length(); i++) {
            char c = sb.charAt(i);
            if (c > 126 && c < lookup.length) {
                char replacement = lookup[c];
                if (replacement > 0)
                    sb.setCharAt(i, replacement);
            }
        }
        return sb.toString();
    }
}