java java正则表达式过滤掉非英文文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6204562/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 14:49:48  来源:igfitidea点击:

java regex to filter out non-English text

javaregex

提问by Regex Rookie

I found a few references to regex filtering out non-English but noneof them is in Java, aside from the fact that they are all referring to somewhat differentproblems than what I am trying to solve:

我发现了一些对 regex 过滤掉非英语的引用,但它们都不是在 Java 中的,除了它们都指的是与我试图解决的问题有些不同的问题:

  1. Replace all non-English characters with a space.
  2. Create a method that returns trueif a string contains any non-English character.
  1. 用空格替换所有非英文字符。
  2. 创建一个方法,该方法true在字符串包含任何非英语字符时返回。

By "English text" I mean not only actual letters and numbers but also punctuation.

“英文文本”不仅指实际的字母和数字,还指标点符号。

So far, what I have been able to come with for goal #1 is quite simple:

到目前为止,我能够为目标 #1 带来的东西非常简单:

String.replaceAll("\W", " ")

In fact, so simple that I suspect that I am missing something... Do you spot any caveats in the above?

事实上,如此简单以至于我怀疑我遗漏了什么......你在上面发现任何警告吗?

As for goal #2, I could simply trim()the string afterthe above replaceAll(), then check if it's empty. But... Is there a more efficient way to do this?

至于目标#2,我可以简单trim()的字符串上面replaceAll(),然后检查它是否是空的。但是......有没有更有效的方法来做到这一点?

回答by Matt Ball

In fact, so simple that I suspect that I am missing something... Do you spot any caveats in the above?

事实上,如此简单以至于我怀疑我遗漏了什么......你在上面发现任何警告吗?

\Wis equivalent to [^\w], and \wis equivalent to [a-zA-Z_0-9]. Using \Wwill replace everythingwhich isn't a letter, a number, or an underscore — like tabs and newline characters. Whether or not that's a problem is really up to you.

\W等价于[^\w]\w等价于[a-zA-Z_0-9]。Using\W将替换所有不是字母、数字或下划线的东西——比如制表符和换行符。这是否是一个问题真的取决于你。

By "English text" I mean not only actual letters and numbers but also punctuation.

“英文文本”不仅指实际的字母和数字,还指标点符号。

In that case, you might want to use a character class which omits punctuation; something like

在这种情况下,您可能希望使用省略标点符号的字符类;就像是

[^\w.,;:'"]

Create a method that returns true if a string contains any non-English character.

如果字符串包含任何非英语字符,则创建一个返回 true 的方法。

Use Patternand Matcher.

使用PatternMatcher

Pattern p = Pattern.compile("\W");

boolean containsSpecialChars(String string)
{
    Matcher m = p.matcher(string);
    return m.find();
}

回答by Eli Mashiah

Here is my solution. I assume the text may contain English words, punctuation marks and standard ascii symbols such as #, %, @ etc.

这是我的解决方案。我假设文本可能包含英文单词、标点符号和标准的 ascii 符号,例如 #、%、@ 等。

private static final String IS_ENGLISH_REGEX = "^[ \w \d \s \. \& \+ \- \, \! \@ \# \$ \% \^ \* \( \) \; \\ \/ \| \< \> \\" \' \? \= \: \[ \] ]*$";

private static boolean isEnglish(String text) {
  if (text == null) {
   return false;
  }
  return text.matches(IS_ENGLISH_REGEX);
 }

回答by Gil SH

This works for me

这对我有用

  private static boolean isEnglish(String text) {
        CharsetEncoder asciiEncoder = Charset.forName("US-ASCII").newEncoder();
        CharsetEncoder isoEncoder = Charset.forName("ISO-8859-1").newEncoder();
        return  asciiEncoder.canEncode(text) || isoEncoder.canEncode(text);
    }

回答by dogbane

Assuming an english word is made up of characters from: [a-zA-Z_0-9]

假设一个英文单词由以下字符组成:[a-zA-Z_0-9]

To return true if a string contains any non-English character, use string.matches:

要在字符串包含任何非英语字符时返回 true,请使用string.matches

return !string.matches("^\w+$");