Java 从 Unicode 字符中删除变音符号 (ń ? ň ? ? ? ? ? ? ? ? ? ? ?)

Question

提问by flybywire

I am looking at an algorithm that can map between characters with diacritics (tilde, circumflex, caret, umlaut, caron) and their "simple" character.

我在看的算法，可以用变音符号（字符之间映射波浪，抑扬，插入符号，变音符号，卡隆）和他们的“简单”的角色。

For example:

例如：

ń  ?  ň  ?  ?  ?  ?  ?  ?  ?  ?  ? ? ? ?  --> n
á --> a
? --> a
? --> a
? --> o

Etc.

等等。

I want to do this in Java, although I suspect it should be something Unicode-y and should be doable reasonably easily in any language.
Purpose: to allow easily search for words with diacritical marks. For example, if I have a database of tennis players, and Bj?rn_Borg is entered, I will also keep Bjorn_Borg so I can find it if someone enters Bjorn and not Bj?rn.

我想在 Java 中做到这一点，尽管我怀疑它应该是 Unicode-y 并且应该可以在任何语言中轻松实现。
目的：允许轻松搜索带有变音符号的单词。例如，如果我有一个网球运动员数据库，并且输入了 Bj?rn_Borg，我也会保留 Bjorn_Borg，这样我就可以在有人输入 Bjorn 而不是 Bj?rn 时找到它。

Answer 1

采纳答案by Andreas Petersson

I have done this recently in Java:

我最近在 Java 中做到了这一点：

public static final Pattern DIACRITICS_AND_FRIENDS
    = Pattern.compile("[\p{InCombiningDiacriticalMarks}\p{IsLm}\p{IsSk}]+");

private static String stripDiacritics(String str) {
    str = Normalizer.normalize(str, Normalizer.Form.NFD);
    str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
    return str;
}

This will do as you specified:

这将按照您的指定进行：

stripDiacritics("Bj?rn")  = Bjorn

but it will fail on for example Bia?ystok, because the ?character is not diacritic.

但它会失败，例如 Bia?ystok，因为该?字符不是变音符号。

If you want to have a full-blown string simplifier, you will need a second cleanup round, for some more special characters that are not diacritics. Is this map, I have included the most common special characters that appear in our customer names. It is not a complete list, but it will give you the idea how to do extend it. The immutableMap is just a simple class from google-collections.

如果你想要一个完整的字符串简化器，你将需要第二轮清理，对于一些不是变音符号的特殊字符。是这张地图吗，我已经包含了出现在我们客户名称中的最常见的特殊字符。这不是一个完整的列表，但它会让你知道如何扩展它。immutableMap 只是来自 google-collections 的一个简单类。

public class StringSimplifier {
    public static final char DEFAULT_REPLACE_CHAR = '-';
    public static final String DEFAULT_REPLACE = String.valueOf(DEFAULT_REPLACE_CHAR);
    private static final ImmutableMap<String, String> NONDIACRITICS = ImmutableMap.<String, String>builder()

        //Remove crap strings with no sematics
        .put(".", "")
        .put("\"", "")
        .put("'", "")

        //Keep relevant characters as seperation
        .put(" ", DEFAULT_REPLACE)
        .put("]", DEFAULT_REPLACE)
        .put("[", DEFAULT_REPLACE)
        .put(")", DEFAULT_REPLACE)
        .put("(", DEFAULT_REPLACE)
        .put("=", DEFAULT_REPLACE)
        .put("!", DEFAULT_REPLACE)
        .put("/", DEFAULT_REPLACE)
        .put("\", DEFAULT_REPLACE)
        .put("&", DEFAULT_REPLACE)
        .put(",", DEFAULT_REPLACE)
        .put("?", DEFAULT_REPLACE)
        .put("°", DEFAULT_REPLACE) //Remove ?? is diacritic?
        .put("|", DEFAULT_REPLACE)
        .put("<", DEFAULT_REPLACE)
        .put(">", DEFAULT_REPLACE)
        .put(";", DEFAULT_REPLACE)
        .put(":", DEFAULT_REPLACE)
        .put("_", DEFAULT_REPLACE)
        .put("#", DEFAULT_REPLACE)
        .put("~", DEFAULT_REPLACE)
        .put("+", DEFAULT_REPLACE)
        .put("*", DEFAULT_REPLACE)

        //Replace non-diacritics as their equivalent characters
        .put("\u0141", "l") // BiaLystock
        .put("\u0142", "l") // Bialystock
        .put("?", "ss")
        .put("?", "ae")
        .put("?", "o")
        .put("?", "c")
        .put("\u00D0", "d") // All D e from http://de.wikipedia.org/wiki/%C3%90
        .put("\u00F0", "d")
        .put("\u0110", "d")
        .put("\u0111", "d")
        .put("\u0189", "d")
        .put("\u0256", "d")
        .put("\u00DE", "th") // thorn T
        .put("\u00FE", "th") // thorn t
        .build();


    public static String simplifiedString(String orig) {
        String str = orig;
        if (str == null) {
            return null;
        }
        str = stripDiacritics(str);
        str = stripNonDiacritics(str);
        if (str.length() == 0) {
            // Ugly special case to work around non-existing empty strings
            // in Oracle. Store original crapstring as simplified.
            // It would return an empty string if Oracle could store it.
            return orig;
        }
        return str.toLowerCase();
    }

    private static String stripNonDiacritics(String orig) {
        StringBuffer ret = new StringBuffer();
        String lastchar = null;
        for (int i = 0; i < orig.length(); i++) {
            String source = orig.substring(i, i + 1);
            String replace = NONDIACRITICS.get(source);
            String toReplace = replace == null ? String.valueOf(source) : replace;
            if (DEFAULT_REPLACE.equals(lastchar) && DEFAULT_REPLACE.equals(toReplace)) {
                toReplace = "";
            } else {
                lastchar = toReplace;
            }
            ret.append(toReplace);
        }
        if (ret.length() > 0 && DEFAULT_REPLACE_CHAR == ret.charAt(ret.length() - 1)) {
            ret.deleteCharAt(ret.length() - 1);
        }
        return ret.toString();
    }

    /*
    Special regular expression character ranges relevant for simplification -> see http://docstore.mik.ua/orelly/perl/prog3/ch05_04.htm
    InCombiningDiacriticalMarks: special marks that are part of "normal" ?, ?, ? etc..
        IsSk: Symbol, Modifier see http://www.fileformat.info/info/unicode/category/Sk/list.htm
        IsLm: Letter, Modifier see http://www.fileformat.info/info/unicode/category/Lm/list.htm
     */
    public static final Pattern DIACRITICS_AND_FRIENDS
        = Pattern.compile("[\p{InCombiningDiacriticalMarks}\p{IsLm}\p{IsSk}]+");


    private static String stripDiacritics(String str) {
        str = Normalizer.normalize(str, Normalizer.Form.NFD);
        str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
        return str;
    }
}

Answer 2

回答by Lucero

Unicode has specific diatric characters (which are composite characters) and a string can be converted so that the character and the diatrics are separated. Then, you can just remove the diatricts from the string and you're basically done.

Unicode 具有特定的 diatric 字符（它们是复合字符），并且可以转换字符串以便将字符和 diatrics 分开。然后，您可以从字符串中删除diatricts，您基本上就完成了。

For more information on normalization, decompositions and equivalence, see The Unicode Standard at the Unicode home page.

有关规范化、分解和等价的更多信息，请参阅Unicode 主页上的 Unicode 标准。

However, how you can actually achieve this depends on the framework/OS/... you're working on. If you're using .NET, you can use the String.Normalizemethod accepting the System.Text.NormalizationFormenumeration.

但是，您如何实际实现这一点取决于您正在使用的框架/操作系统/...。如果您使用 .NET，则可以使用接受System.Text.NormalizationForm枚举的String.Normalize方法。

Answer 3

回答by ire_and_curses

There is a draft reporton character folding on the unicode website which has a lot of relevant material. See specifically Section 4.1. "Folding algorithm".

有一个报告草案对Unicode网站里面有很多相关材料的人品折叠。具体参见第 4.1 节。“折叠算法”。

Here's a discussion and implementationof diacritic marker removal using Perl.

这是使用 Perl 去除变音符号标记的讨论和实现。

These existing SO questions are related:

这些现有的 SO 问题是相关的：

Answer 4

回答by nils

You could use the Normalizer classfrom java.text:

您可以使用正规化类的java.text：

System.out.println(new String(Normalizer.normalize("ń ? ň ? ? ? ? ?", Normalizer.Form.NFKD).getBytes("ascii"), "ascii"));

But there is still some work to do, since Java makes strange things with unconvertable Unicode characters (it does not ignore them, and it does not throw an exception). But I think you could use that as an starting point.

但是还有一些工作要做，因为 Java 用不可转换的 Unicode 字符制造了奇怪的东西（它不会忽略它们，也不会抛出异常）。但我认为你可以以此为起点。

Answer 5

回答by paxdiablo

The easiest way (to me) would be to simply maintain a sparse mapping array which simply changes your Unicode code points into displayable strings.

最简单的方法（对我来说）是简单地维护一个稀疏映射数组，它只是将您的 Unicode 代码点更改为可显示的字符串。

Such as:

如：

start    = 0x00C0
size     = 23
mappings = {
    "A","A","A","A","A","A","AE","C",
    "E","E","E","E","I","I","I", "I",
    "D","N","O","O","O","O","O"
}
start    = 0x00D8
size     = 6
mappings = {
    "O","U","U","U","U","Y"
}
start    = 0x00E0
size     = 23
mappings = {
    "a","a","a","a","a","a","ae","c",
    "e","e","e","e","i","i","i", "i",
    "d","n","o","o","o","o","o"
}
start    = 0x00F8
size     = 6
mappings = {
    "o","u","u","u","u","y"
}
: : :

The use of a sparsearray will allow you to efficiently represent replacements even when they in widely spaced sections of the Unicode table. String replacements will allow arbitrary sequences to replace your diacritics (such as the ?grapheme becoming ae).

稀疏数组的使用将允许您有效地表示替换，即使它们位于 Unicode 表的大间距部分。字符串替换将允许任意序列替换您的变音符号（例如?字形变成ae）。

This is a language-agnostic answer so, if you have a specific language in mind, there will be better ways (although they'll all likely come down to this at the lowest levels anyway).

这是一个与语言无关的答案，因此，如果您有一种特定的语言，那么会有更好的方法（尽管无论如何它们都可能归结为最低级别）。

Answer 6

回答by erickson

The core java.text package was designed to address this use case (matching strings without caring about diacritics, case, etc.).

核心 java.text 包旨在解决此用例（匹配字符串而不关心变音符号、大小写等）。

Configure a Collatorto sort on PRIMARYdifferences in characters. With that, create a CollationKeyfor each string. If all of your code is in Java, you can use the CollationKeydirectly. If you need to store the keys in a database or other sort of index, you can convert it to a byte array.

配置 aCollator以对PRIMARY字符差异进行排序。这样，CollationKey为每个字符串创建一个。如果您的所有代码都是 Java 语言，则可以CollationKey直接使用。如果您需要将键存储在数据库或其他类型的索引中，您可以将其转换为字节数组。

These classes use the Unicode standardcase folding data to determine which characters are equivalent, and support various decompositionstrategies.

这些类使用Unicode 标准大小写折叠数据来确定哪些字符是等效的，并支持各种分解策略。

Collator c = Collator.getInstance();
c.setStrength(Collator.PRIMARY);
Map<CollationKey, String> dictionary = new TreeMap<CollationKey, String>();
dictionary.put(c.getCollationKey("Bj?rn"), "Bj?rn");
...
CollationKey query = c.getCollationKey("bjorn");
System.out.println(dictionary.get(query)); // --> "Bj?rn"

Note that collators are locale-specific. This is because "alphabetical order" is differs between locales (and even over time, as has been the case with Spanish). The Collatorclass relieves you from having to track all of these rules and keep them up to date.

请注意，整理器是特定于语言环境的。这是因为“字母顺序”在语言环境之间是不同的（甚至随着时间的推移，就像西班牙语的情况一样）。该Collator课程使您不必跟踪所有这些规则并使其保持最新状态。

Answer 7

回答by Viktor Jevdokimov

In Windows and .NET, I just convert using string encoding. That way I avoid manual mapping and coding.

在 Windows 和 .NET 中，我只是使用字符串编码进行转换。这样我就避免了手动映射和编码。

Try to play with string encoding.

尝试使用字符串编码。

Answer 8

回答by Beska

Something to consider: if you go the route of trying to get a single "translation" of each word, you may miss out on some possible alternates.

需要考虑的事情：如果您尝试对每个单词进行单一“翻译”，您可能会错过一些可能的替代词。

For instance, in German, when replacing the "s-set", some people might use "B", while others might use "ss". Or, replacing an umlauted o with "o" or "oe". Any solution you come up with, ideally, I would think should include both.

例如，在德语中，替换“s-set”时，有些人可能会使用“B”，而其他人可能会使用“ss”。或者，用“o”或“oe”替换变音的 o。您提出的任何解决方案，理想情况下，我认为应该包括两者。

Answer 9

回答by Nathan Baulch

For future reference, here is a C# extension method that removes accents.

为了将来参考，这里是一个删除重音的 C# 扩展方法。

public static class StringExtensions
{
    public static string RemoveDiacritics(this string str)
    {
        return new string(
            str.Normalize(NormalizationForm.FormD)
                .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != 
                            UnicodeCategory.NonSpacingMark)
                .ToArray());
    }
}
static void Main()
{
    var input = "??? àá???? ???? ?? àáa??? ń?ň";
    var output = input.RemoveDiacritics();
    Debug.Assert(output == "NNN AAAAAA TTtt Hh aaaaaa nnn");
}

Answer 10

回答by unwind

Please note that not all of these marks are just "marks" on some "normal" character, that you can remove without changing the meaning.

请注意，并非所有这些标记都只是某些“正常”字符上的“标记”，您可以在不改变含义的情况下将其删除。

In Swedish, å ä and ö are true and proper first-class characters, not some "variant" of some other character. They sound different from all other characters, they sort different, and they make words change meaning ("mätt" and "matt" are two different words).

在瑞典语中，å ä 和 ö 是真实且恰当的一流字符，而不是其他字符的“变体”。它们听起来与所有其他字符不同，它们的排序不同，并且它们使单词改变含义（“mätt”和“matt”是两个不同的词）。

Java 从 Unicode 字符中删除变音符号 (ń ? ň ? ? ? ? ? ? ? ? ? ? ?)

提问by flybywire

采纳答案by Andreas Petersson

回答by Lucero

回答by ire_and_curses

回答by nils

回答by paxdiablo

回答by erickson

回答by Viktor Jevdokimov

回答by Beska

回答by Nathan Baulch

回答by unwind

相关推荐

最近更新

标签

Java 从 Unicode 字符中删除变音符号 (ń ? ň ? ? ? ? ? ? ? ? ? ? ?)

提问by flybywire

采纳答案by Andreas Petersson

回答by Lucero

回答by ire_and_curses

回答by nils

回答by paxdiablo

回答by erickson

回答by Viktor Jevdokimov

回答by Beska

回答by Nathan Baulch

回答by unwind

相关推荐

Java 何时使用 Jersey 的 @Singleton 注释？

Java 注册 Oracle JDBC Diagnosability MBean 时出错

Java 什么时候类应该是 Comparable 和/或 Comparator？

Java 字符串的句点

相关推荐

最近更新

标签