java java正则表达式过滤掉非英文文本

Question

提问by Regex Rookie

I found a few references to regex filtering out non-English but noneof them is in Java, aside from the fact that they are all referring to somewhat differentproblems than what I am trying to solve:

我发现了一些对 regex 过滤掉非英语的引用，但它们都不是在 Java 中的，除了它们都指的是与我试图解决的问题有些不同的问题：

Replace all non-English characters with a space.
Create a method that returns trueif a string contains any non-English character.

用空格替换所有非英文字符。
创建一个方法，该方法true在字符串包含任何非英语字符时返回。

By "English text" I mean not only actual letters and numbers but also punctuation.

“英文文本”不仅指实际的字母和数字，还指标点符号。

So far, what I have been able to come with for goal #1 is quite simple:

到目前为止，我能够为目标 #1 带来的东西非常简单：

String.replaceAll("\W", " ")

In fact, so simple that I suspect that I am missing something... Do you spot any caveats in the above?

事实上，如此简单以至于我怀疑我遗漏了什么......你在上面发现任何警告吗？

As for goal #2, I could simply trim()the string afterthe above replaceAll(), then check if it's empty. But... Is there a more efficient way to do this?

至于目标＃2，我可以简单trim()的字符串后上面replaceAll()，然后检查它是否是空的。但是......有没有更有效的方法来做到这一点？

Answer 1

回答by Matt Ball

In fact, so simple that I suspect that I am missing something... Do you spot any caveats in the above?

事实上，如此简单以至于我怀疑我遗漏了什么......你在上面发现任何警告吗？

\Wis equivalent to [^\w], and \wis equivalent to [a-zA-Z_0-9]. Using \Wwill replace everythingwhich isn't a letter, a number, or an underscore — like tabs and newline characters. Whether or not that's a problem is really up to you.

\W等价于[^\w]，\w等价于[a-zA-Z_0-9]。Using\W将替换所有不是字母、数字或下划线的东西——比如制表符和换行符。这是否是一个问题真的取决于你。

By "English text" I mean not only actual letters and numbers but also punctuation.

“英文文本”不仅指实际的字母和数字，还指标点符号。

In that case, you might want to use a character class which omits punctuation; something like

在这种情况下，您可能希望使用省略标点符号的字符类；就像是

[^\w.,;:'"]

Create a method that returns true if a string contains any non-English character.

如果字符串包含任何非英语字符，则创建一个返回 true 的方法。

Use Patternand Matcher.

使用Pattern和Matcher。

Pattern p = Pattern.compile("\W");

boolean containsSpecialChars(String string)
{
    Matcher m = p.matcher(string);
    return m.find();
}

Answer 2

回答by Eli Mashiah

Here is my solution. I assume the text may contain English words, punctuation marks and standard ascii symbols such as #, %, @ etc.

这是我的解决方案。我假设文本可能包含英文单词、标点符号和标准的 ascii 符号，例如 #、%、@ 等。

private static final String IS_ENGLISH_REGEX = "^[ \w \d \s \. \& \+ \- \, \! \@ \# \$ \% \^ \* \( \) \; \\ \/ \| \< \> \\" \' \? \= \: \[ \] ]*$";

private static boolean isEnglish(String text) {
  if (text == null) {
   return false;
  }
  return text.matches(IS_ENGLISH_REGEX);
 }

Answer 3

回答by Gil SH

This works for me

这对我有用

  private static boolean isEnglish(String text) {
        CharsetEncoder asciiEncoder = Charset.forName("US-ASCII").newEncoder();
        CharsetEncoder isoEncoder = Charset.forName("ISO-8859-1").newEncoder();
        return  asciiEncoder.canEncode(text) || isoEncoder.canEncode(text);
    }

Answer 4

回答by dogbane

Assuming an english word is made up of characters from: [a-zA-Z_0-9]

假设一个英文单词由以下字符组成：[a-zA-Z_0-9]

To return true if a string contains any non-English character, use string.matches:

要在字符串包含任何非英语字符时返回 true，请使用string.matches：

return !string.matches("^\w+$");

java java正则表达式过滤掉非英文文本

提问by Regex Rookie

回答by Matt Ball

回答by Eli Mashiah

回答by Gil SH

回答by dogbane

相关推荐

最近更新

标签

java java正则表达式过滤掉非英文文本

提问by Regex Rookie

回答by Matt Ball

回答by Eli Mashiah

回答by Gil SH

回答by dogbane

相关推荐

java 我如何在 android 中正确处理触摸事件？

java EJB 3.1 依赖注入失败

非常基本 - Java 数组作为类变量

来自 URL 文本文件的 Java One-liner Scanner

相关推荐

最近更新

标签