从 Java 中的字符串中删除所有非“单词字符”,留下重音字符?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1611979/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 17:48:47  来源:igfitidea点击:

Remove all non-"word characters" from a String in Java, leaving accented characters?

javaregexstring

提问by Epaga

Apparently Java's Regex flavor counts Umlauts and other special characters as non-"word characters" when I use Regex.

显然,当我使用正则表达式时,Java 的正则表达式风格将变音符号和其他特殊字符视为非“单词字符”。

        "TESTüTEST".replaceAll( "\W", "" )

returns "TESTTEST" for me. What I want is for only all truly non-"word characters" to be removed. Any way to do this without having something along the lines of

为我返回“TESTTEST”。我想要的是只删除所有真正的非“单词字符”。任何方法都可以做到这一点而没有类似的东西

         "[^A-Za-z0-9??ü??ü?éèáàúùóò]"

only to realize I forgot ??

才意识到我忘记了??

采纳答案by Tim Pietzcker

Use [^\p{L}\p{Nd}]+- this matches all (Unicode) characters that are neither letters nor (decimal) digits.

使用[^\p{L}\p{Nd}]+- 匹配所有既不是字母也不是(十进制)数字的(Unicode)字符。

In Java:

在 Java 中:

String resultString = subjectString.replaceAll("[^\p{L}\p{Nd}]+", "");

Edit:

编辑:

I changed \p{N}to \p{Nd}because the former also matches some number symbols like ?; the latter doesn't. See it on regex101.com.

我换\p{N}\p{Nd},因为前者还喜欢一些数字符号匹配?; 后者没有。在regex101.com查看

回答by Epaga

Well, here is one solution I ended up with, but I hope there's a more elegant one...

好吧,这是我最终得到的一个解决方案,但我希望有一个更优雅的解决方案......

StringBuilder result = new StringBuilder();
for(int i=0; i<name.length(); i++) {
    char tmpChar = name.charAt( i );
    if (Character.isLetterOrDigit( tmpChar) || tmpChar == '_' ) {
        result.append( tmpChar );
    }
}

resultends up with the desired result...

result最终得到了想要的结果......

回答by Stefan Haberl

At times you do not want to simply remove the characters, but just remove the accents. I came up with the following utility class which I use in my Java REST web projects whenever I need to include a String in an URL:

有时您不想简单地删除字符,而只想删除重音符号。我想出了以下实用程序类,每当我需要在 URL 中包含字符串时,我都会在我的 Java REST Web 项目中使用它:

import java.text.Normalizer;
import java.text.Normalizer.Form;

import org.apache.commons.lang.StringUtils;

/**
 * Utility class for String manipulation.
 * 
 * @author Stefan Haberl
 */
public abstract class TextUtils {
    private static String[] searchList = { "?", "?", "?", "?", "ü", "ü", "?" };
    private static String[] replaceList = { "Ae", "ae", "Oe", "oe", "Ue", "ue",
            "sz" };

    /**
     * Normalizes a String by removing all accents to original 127 US-ASCII
     * characters. This method handles German umlauts and "sharp-s" correctly
     * 
     * @param s
     *            The String to normalize
     * @return The normalized String
     */
    public static String normalize(String s) {
        if (s == null)
            return null;

        String n = null;

        n = StringUtils.replaceEachRepeatedly(s, searchList, replaceList);
        n = Normalizer.normalize(n, Form.NFD).replaceAll("[^\p{ASCII}]", "");

        return n;
    }

    /**
     * Returns a clean representation of a String which might be used safely
     * within an URL. Slugs are a more human friendly form of URL encoding a
     * String.
     * <p>
     * The method first normalizes a String, then converts it to lowercase and
     * removes ASCII characters, which might be problematic in URLs:
     * <ul>
     * <li>all whitespaces
     * <li>dots ('.')
     * <li>(semi-)colons (';' and ':')
     * <li>equals ('=')
     * <li>ampersands ('&')
     * <li>slashes ('/')
     * <li>angle brackets ('<' and '>')
     * </ul>
     * 
     * @param s
     *            The String to slugify
     * @return The slugified String
     * @see #normalize(String)
     */
    public static String slugify(String s) {

        if (s == null)
            return null;

        String n = normalize(s);
        n = StringUtils.lowerCase(n);
        n = n.replaceAll("[\s.:;&=<>/]", "");

        return n;
    }
}

Being a German speaker I've included proper handling of German umlauts as well - the list should be easy to extend for other languages.

作为一名讲德语的人,我还包括正确处理德语变音符号 - 该列表应该很容易扩展到其他语言。

HTH

HTH

EDIT:Note that it maybe unsafe to include the returned String in an URL. You should at least HTML encode it to prevent XSS attacks.

编辑:请注意,将返回的字符串包含在 URL 中可能不安全。你至少应该对它进行 HTML 编码以防止 XSS 攻击。

回答by István

You might want to remove the accents and diacritic signs first, then on each character position check if the "simplified" string is an ascii letter - if it is, the original position shall contain word characters, if not, it can be removed.

您可能希望首先删除重音符号和变音符号,然后在每个字符位置检查“简化”字符串是否为 ascii 字母 - 如果是,则原始位置应包含单词字符,如果不是,则可以将其删除。

回答by Mena

I was trying to achieve the exact opposite when I bumped on this thread. I know it's quite old, but here's my solution nonetheless. You can use blocks, see here. In this case, compile the following code (with the right imports):

当我碰到这个线程时,我试图实现完全相反的目标。我知道它已经很老了,但这是我的解决方案。您可以使用块,请参见此处。在这种情况下,编译以下代码(使用正确的导入):

> String s = "?êìóblah"; 
> Pattern p = Pattern.compile("[\p{InLatin-1Supplement}]+"); // this regex uses a block
> Matcher m = p.matcher(s);
> System.out.println(m.find());
> System.out.println(s.replaceAll(p.pattern(), "#"));

You should see the following output:

您应该看到以下输出:

true

#blah

真的

#等等

Best,

最好的事物,

回答by Paul Dinesh

You can use StringUtils from apache

您可以使用 apache 中的 StringUtils