在java中替换字符串中的任何非ascii字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18623868/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 09:36:38  来源:igfitidea点击:

replace any non-ascii character in a string in java

javaregexunicode

提问by leba-lev

How would one convert -lrb-300-rrb- 922-6590to -lrb-300-rrb- 922-6590in java?

在java中如何转换-lrb-300-rrb-┬á922-6590-lrb-300-rrb- 922-6590

Have tried the following:

尝试了以下方法:

t.lemma = lemma.replaceAll("\p{C}", " ");
t.lemma = lemma.replaceAll("[\u0000-\u001f]", " ");

Am probably missing something conceptual. Will appreciate any pointers to the solution.

我可能缺少一些概念性的东西。将不胜感激任何指向解决方案的指针。

Thank you

谢谢

采纳答案by Paul Vargas

Try the next:

尝试下一个:

str = str.replaceAll("[^\\p{ASCII}]", " ");

str = str.replaceAll("[^\\p{ASCII}]", " ");

By the way, \p{ASCII}is all ASCII: [\x00-\x7F].

顺便说一句,\p{ASCII}是所有ASCII: [\x00-\x7F]

In ahother hand, you need to use a constant of Patternfor avoid recompiled the expression every time.

另一方面,您需要使用常量Pattern以避免每次都重新编译表达式。

private static final Pattern REGEX_PATTERN = 
        Pattern.compile("[^\p{ASCII}]");

public static void main(String[] args) {
    String input = "-lrb-300-rrb- 922-6590";
    System.out.println(
        REGEX_PATTERN.matcher(input).replaceAll(" ")
    );  // prints "-lrb-300-rrb- 922-6590"
}

See also:

也可以看看:

回答by assylias

Assuming you only want to keep a-zA-Z0-9and punctuation characters, you could do:

假设你只想保留a-zA-Z0-9和标点符号,你可以这样做:

t.lemma = lemma.replaceAll("[^\p{Punct}\w]", " "));