从 Java 中的字符串中删除所有非“单词字符”，留下重音字符？

Question

提问by Epaga

Apparently Java's Regex flavor counts Umlauts and other special characters as non-"word characters" when I use Regex.

显然，当我使用正则表达式时，Java 的正则表达式风格将变音符号和其他特殊字符视为非“单词字符”。

        "TESTüTEST".replaceAll( "\W", "" )

returns "TESTTEST" for me. What I want is for only all truly non-"word characters" to be removed. Any way to do this without having something along the lines of

为我返回“TESTTEST”。我想要的是只删除所有真正的非“单词字符”。任何方法都可以做到这一点而没有类似的东西

         "[^A-Za-z0-9??ü??ü?éèáàúùóò]"

only to realize I forgot ??

才意识到我忘记了？？

Answer 1

采纳答案by Tim Pietzcker

Use [^\p{L}\p{Nd}]+- this matches all (Unicode) characters that are neither letters nor (decimal) digits.

使用[^\p{L}\p{Nd}]+- 匹配所有既不是字母也不是（十进制）数字的（Unicode）字符。

In Java:

在 Java 中：

String resultString = subjectString.replaceAll("[^\p{L}\p{Nd}]+", "");

Edit:

编辑：

I changed \p{N}to \p{Nd}because the former also matches some number symbols like ?; the latter doesn't. See it on regex101.com.

我换\p{N}到\p{Nd}，因为前者还喜欢一些数字符号匹配?; 后者没有。在regex101.com上查看。

Answer 2

回答by Epaga

Well, here is one solution I ended up with, but I hope there's a more elegant one...

好吧，这是我最终得到的一个解决方案，但我希望有一个更优雅的解决方案......

StringBuilder result = new StringBuilder();
for(int i=0; i<name.length(); i++) {
    char tmpChar = name.charAt( i );
    if (Character.isLetterOrDigit( tmpChar) || tmpChar == '_' ) {
        result.append( tmpChar );
    }
}

resultends up with the desired result...

result最终得到了想要的结果......

Answer 3

回答by Stefan Haberl

At times you do not want to simply remove the characters, but just remove the accents. I came up with the following utility class which I use in my Java REST web projects whenever I need to include a String in an URL:

有时您不想简单地删除字符，而只想删除重音符号。我想出了以下实用程序类，每当我需要在 URL 中包含字符串时，我都会在我的 Java REST Web 项目中使用它：

import java.text.Normalizer;
import java.text.Normalizer.Form;

import org.apache.commons.lang.StringUtils;

/**
 * Utility class for String manipulation.
 * 
 * @author Stefan Haberl
 */
public abstract class TextUtils {
    private static String[] searchList = { "?", "?", "?", "?", "ü", "ü", "?" };
    private static String[] replaceList = { "Ae", "ae", "Oe", "oe", "Ue", "ue",
            "sz" };

    /**
     * Normalizes a String by removing all accents to original 127 US-ASCII
     * characters. This method handles German umlauts and "sharp-s" correctly
     * 
     * @param s
     *            The String to normalize
     * @return The normalized String
     */
    public static String normalize(String s) {
        if (s == null)
            return null;

        String n = null;

        n = StringUtils.replaceEachRepeatedly(s, searchList, replaceList);
        n = Normalizer.normalize(n, Form.NFD).replaceAll("[^\p{ASCII}]", "");

        return n;
    }

    /**
     * Returns a clean representation of a String which might be used safely
     * within an URL. Slugs are a more human friendly form of URL encoding a
     * String.
     * <p>
     * The method first normalizes a String, then converts it to lowercase and
     * removes ASCII characters, which might be problematic in URLs:
     * <ul>
     * <li>all whitespaces
     * <li>dots ('.')
     * <li>(semi-)colons (';' and ':')
     * <li>equals ('=')
     * <li>ampersands ('&')
     * <li>slashes ('/')
     * <li>angle brackets ('<' and '>')
     * </ul>
     * 
     * @param s
     *            The String to slugify
     * @return The slugified String
     * @see #normalize(String)
     */
    public static String slugify(String s) {

        if (s == null)
            return null;

        String n = normalize(s);
        n = StringUtils.lowerCase(n);
        n = n.replaceAll("[\s.:;&=<>/]", "");

        return n;
    }
}

Being a German speaker I've included proper handling of German umlauts as well - the list should be easy to extend for other languages.

作为一名讲德语的人，我还包括正确处理德语变音符号 - 该列表应该很容易扩展到其他语言。

HTH

EDIT:Note that it maybe unsafe to include the returned String in an URL. You should at least HTML encode it to prevent XSS attacks.

编辑：请注意，将返回的字符串包含在 URL 中可能不安全。你至少应该对它进行 HTML 编码以防止 XSS 攻击。

Answer 4

回答by István

You might want to remove the accents and diacritic signs first, then on each character position check if the "simplified" string is an ascii letter - if it is, the original position shall contain word characters, if not, it can be removed.

您可能希望首先删除重音符号和变音符号，然后在每个字符位置检查“简化”字符串是否为 ascii 字母 - 如果是，则原始位置应包含单词字符，如果不是，则可以将其删除。

Answer 5

回答by Mena

I was trying to achieve the exact opposite when I bumped on this thread. I know it's quite old, but here's my solution nonetheless. You can use blocks, see here. In this case, compile the following code (with the right imports):

当我碰到这个线程时，我试图实现完全相反的目标。我知道它已经很老了，但这是我的解决方案。您可以使用块，请参见此处。在这种情况下，编译以下代码（使用正确的导入）：

> String s = "?êìóblah"; 
> Pattern p = Pattern.compile("[\p{InLatin-1Supplement}]+"); // this regex uses a block
> Matcher m = p.matcher(s);
> System.out.println(m.find());
> System.out.println(s.replaceAll(p.pattern(), "#"));

You should see the following output:

您应该看到以下输出：

true
#blah

真的
#等等

Best,

最好的事物，

Answer 6

回答by Paul Dinesh

You can use StringUtils from apache

您可以使用 apache 中的 StringUtils

从 Java 中的字符串中删除所有非“单词字符”，留下重音字符？

提问by Epaga

采纳答案by Tim Pietzcker

回答by Epaga

回答by Stefan Haberl

回答by István

回答by Mena

回答by Paul Dinesh

相关推荐

最近更新

标签

从 Java 中的字符串中删除所有非“单词字符”，留下重音字符？

提问by Epaga

采纳答案by Tim Pietzcker

回答by Epaga

回答by Stefan Haberl

回答by István

回答by Mena

回答by Paul Dinesh

相关推荐

使用 Java 进行排序和二分搜索

Java 如何将日期转换为字符串并再次转换为日期？

Java 如何根据 Key 对 JSON 对象进行排序？

Java：将二进制字符串转换为十六进制字符串

相关推荐

最近更新

标签