java 从java中的另一个字符串中删除字符串

Question

提问by Mat B.

Lets say I have this list of words:

假设我有这个单词列表：

 String[] stopWords = new String[]{"i","a","and","about","an","are","as","at","be","by","com","for","from","how","in","is","it","not","of","on","or","that","the","this","to","was","what","when","where","who","will","with","the","www"};

Than I have text

比我有文字

 String text = "I would like to do a nice novel about nature AND people"

Is there method that matches the stopWords and removes them while ignoring case; like this somewhere out there?:

是否有匹配停用词并在忽略大小写的情况下删除它们的方法；像这样的地方吗？：

 String noStopWordsText = remove(text, stopWords);

Result:

结果：

 " would like do nice novel nature people"

If you know about regex that wold work great but I would really prefer something like commons solution that is bit more performance oriented.

如果你知道正则表达式会很好用，但我真的更喜欢像 commons 解决方案这样的更注重性能的东西。

BTW, right now I'm using this commons method which is lacking proper insensitive case handling:

顺便说一句，现在我正在使用这种缺乏适当的不敏感大小写处理的公共方法：

 private static final String[] stopWords = new String[]{"i", "a", "and", "about", "an", "are", "as", "at", "be", "by", "com", "for", "from", "how", "in", "is", "it", "not", "of", "on", "or", "that", "the", "this", "to", "was", "what", "when", "where", "who", "will", "with", "the", "www", "I", "A", "AND", "ABOUT", "AN", "ARE", "AS", "AT", "BE", "BY", "COM", "FOR", "FROM", "HOW", "IN", "IS", "IT", "NOT", "OF", "ON", "OR", "THAT", "THE", "THIS", "TO", "WAS", "WHAT", "WHEN", "WHERE", "WHO", "WILL", "WITH", "THE", "WWW"};
 private static final String[] blanksForStopWords = new String[]{"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""};

 noStopWordsText = StringUtils.replaceEach(text, stopWords, blanksForStopWords);

Answer 1

采纳答案by Theo

This is a solution that does not use regular expressions. I think it's inferior to my other answer because it is much longer and less clear, but if performance is really, really important then this is O(n)where nis the length of the text.

这是一个不使用正则表达式的解决方案。我认为它不如我的其他答案，因为它更长而且不太清楚，但是如果性能真的非常重要，那么这是O(n)，其中n是文本的长度。

Set<String> stopWords = new HashSet<String>();
stopWords.add("a");
stopWords.add("and");
// and so on ...

String sampleText = "I would like to do a nice novel about nature AND people";
StringBuffer clean = new StringBuffer();
int index = 0;

while (index < sampleText.length) {
  // the only word delimiter supported is space, if you want other
  // delimiters you have to do a series of indexOf calls and see which
  // one gives the smallest index, or use regex
  int nextIndex = sampleText.indexOf(" ", index);
  if (nextIndex == -1) {
    nextIndex = sampleText.length - 1;
  }
  String word = sampleText.substring(index, nextIndex);
  if (!stopWords.contains(word.toLowerCase())) {
    clean.append(word);
    if (nextIndex < sampleText.length) {
      // this adds the word delimiter, e.g. the following space
      clean.append(sampleText.substring(nextIndex, nextIndex + 1)); 
    }
  }
  index = nextIndex + 1;
}

System.out.println("Stop words removed: " + clean.toString());

Answer 2

回答by Theo

Create a regular expression with your stop words, make it case insensitive, and then use the matcher's replaceAllmethod to replace all matches with an empty string

使用您的停用词创建一个正则表达式，使其不区分大小写，然后使用匹配器的replaceAll方法将所有匹配项替换为空字符串

import java.util.regex.*;

Pattern stopWords = Pattern.compile("\b(?:i|a|and|about|an|are|...)\b\s*", Pattern.CASE_INSENSITIVE);
Matcher matcher = stopWords.matcher("I would like to do a nice novel about nature AND people");
String clean = matcher.replaceAll("");

the ...in the pattern is just me being lazy, continue the list of stop words.

将...在模式只是我懒惰，继续停止词列表。

Another method is to loop over all the stop words and use String's replaceAllmethod. The problem with that approach is that replaceAllwill compile a new regular expression for each call, so it's not very efficient to use in loops. Also, you can't pass the flag that makes the regular expression case insensitive when you use String's replaceAll.

另一种方法是遍历所有停用词并使用String的replaceAll方法。这种方法的问题在于，它replaceAll会为每次调用编译一个新的正则表达式，因此在循环中使用效率不高。此外，当您使用String's时，您不能传递使正则表达式不区分大小写的标志replaceAll。

Edit: I added \baround the pattern to make it match whole words only. I also added \s*to make it glob up any spaces after, that's maybe not necessary.

编辑：我\b在模式周围添加了它以使其仅匹配整个单词。我还添加\s*了它之后的任何空格，这可能没有必要。

Answer 3

回答by Jigar Joshi

You can make a reg expression to match all the stop words[for example a, note space here]and end up with

您可以创建一个 reg 表达式来匹配所有停用词[例如a，请注意此处的空格] 并以

str.replaceAll(regexpression,"");

OR

或者

 String[] stopWords = new String[]{" i ", " a ", " and ", " about ", " an ", " are ", " as ", " at ", " be ", " by ", " com ", " for ", " from ", " how ", " in ", " is ", " it ", " not ", " of ", " on ", " or ", " that ", " the ", " this ", " to ", " was ", " what ", " when ", " where ", " who ", " will ", " with ", " the ", " www "};
        String text = " I would like to do a nice novel about nature AND people ";

        for (String stopword : stopWords) {
            text = text.replaceAll("(?i)"+stopword, " ");
        }
        System.out.println(text);

output:

输出：

 would like do nice novel nature people

IdeOneDemo

IDEOneDemo

There might be better way.

可能有更好的方法。

Answer 4

回答by fastcodejava

Split texton whilespace. Then loop through the array and keep appending to a StringBuilderonly if it is not one of the stop words.

斯普利特text在whilespace。然后循环遍历数组并StringBuilder仅在它不是停用词之一时才继续附加到 a 。

java 从java中的另一个字符串中删除字符串

提问by Mat B.

采纳答案by Theo

回答by Theo

回答by Jigar Joshi

回答by fastcodejava

相关推荐

最近更新

标签

java 从java中的另一个字符串中删除字符串

提问by Mat B.

采纳答案by Theo

回答by Theo

回答by Jigar Joshi

回答by fastcodejava

相关推荐

java @Autowired 注释无法在 JUnit 类中注入 bean

是否有适用于 Java 的 WYSIWYG 编辑器？

java JPanel里面另一个

使 Java 属性跨类可用？

相关推荐

最近更新

标签