从 Java 中的字符串中删除停用词

Question

提问by JavaLearner

I have a string with lots of words and I have a text file which contains some Stopwords which I need to remove from my String. Let's say I have a String

我有一个包含很多单词的字符串，我有一个文本文件，其中包含一些我需要从字符串中删除的停用词。假设我有一个字符串

s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."

After removing stopwords, string should be like :

删除停用词后，字符串应如下所示：

"love phone, super fast much cool jelly bean....but recently bugs."

I have been able to achieve this but the problem I am facing is that whenver there are adjacent stopwords in the String its removing only the first and I am getting result as :

我已经能够做到这一点，但我面临的问题是，只要字符串中有相邻的停用词，它只会删除第一个，我得到的结果如下：

"love phone, super fast there's much and cool with jelly bean....but recently seen bugs"

Here's my stopwordslist.txt file : Stopwords

这是我的stopwordslist.txt文件： Stopwords

How can I solve this problem. Here's what I have done so far :

我怎么解决这个问题。这是我到目前为止所做的：

int k=0,i,j;
ArrayList<String> wordsList = new ArrayList<String>();
String sCurrentLine;
String[] stopwords = new String[2000];
try{
        FileReader fr=new FileReader("F:\stopwordslist.txt");
        BufferedReader br= new BufferedReader(fr);
        while ((sCurrentLine = br.readLine()) != null){
            stopwords[k]=sCurrentLine;
            k++;
        }
        String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
        StringBuilder builder = new StringBuilder(s);
        String[] words = builder.toString().split("\s");
        for (String word : words){
            wordsList.add(word);
        }
        for(int ii = 0; ii < wordsList.size(); ii++){
            for(int jj = 0; jj < k; jj++){
                if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
                    wordsList.remove(ii);
                    break;
                }
             }
        }
        for (String str : wordsList){
            System.out.print(str+" ");
        }   
    }catch(Exception ex){
        System.out.println(ex);
    }

Answer 1

采纳答案by alain.janinm

The error is because you remove element from the list you iterate on. Let says you have wordsListthat contains |word0|word1|word2|If iiis equal to 1and the if test is true, then you call wordsList.remove(1);. After that your list is |word0|word2|. iiis then incremented and is equal to 2and now it's above the size of your list, hence word2will never be tested.

From there there is several solutions. For example instead of removing values you can set value to "". Or create a special "result" list.

从那里有几种解决方案。例如，您可以将值设置为“”，而不是删除值。或者创建一个特殊的“结果”列表。

Answer 2

回答by Darshan Lila

Here's try it following way:

下面是尝试它的方式：

   String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
   String stopWords[]={"love","this","cool"};
   for(int i=0;i<stopWords.length;i++){
       if(s.contains(stopWords[i])){
           s=s.replaceAll(stopWords[i]+"\s+", ""); //note this will remove spaces at the end
       }
   }
   System.out.println(s);

This way you final output will be without the words you don't want in it. Just get a list of stop words in an array and replace in required string.
Output for my stopwords:

这样你的最终输出将没有你不想要的词。只需获取数组中的停用词列表并替换所需的字符串即可。
我的停用词的输出：

I   phone, its super fast and there's so much new and  things with jelly bean....but of recently I've seen some bugs.

Answer 3

回答by Vimal Bera

Instead why don't you use below approach. It will be easier to read and understand :

相反，为什么不使用以下方法。阅读和理解会更容易：

for(String word : words){
    s = s.replace(word+"\s*", "");
}
System.out.println(s);//It will print removed word string.

Answer 4

回答by geert3

This is a much more elegant solution (IMHO), using only regular expressions:

这是一个更优雅的解决方案（恕我直言），仅使用正则表达式：

    // instead of the ".....", add all your stopwords, separated by "|"
    // "\b" is to account for word boundaries, i.e. not replace "his" in "this"
    // the "\s?" is to suppress optional trailing white space
    Pattern p = Pattern.compile("\b(I|this|its.....)\b\s?");
    Matcher m = p.matcher("I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.");
    String s = m.replaceAll("");
    System.out.println(s);

Answer 5

回答by SMA

Try using replaceAllapi of String like:

尝试使用String 的replaceAllapi，例如：

String myString = "I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
String stopWords = "I|its|with|but";
String afterStopWords = myString.replaceAll("(" + stopWords + ")\s*", "");
System.out.println(afterStopWords);

OUTPUT: 
love this phone, super fast and there's so much new and cool things jelly bean....of recently 've seen some bugs.

Answer 6

回答by robin

Try the program below.

试试下面的程序。

String s="I love this phone, its super fast and there's so" +
            " much new and cool things with jelly bean....but of recently I've seen some bugs.";
    String[] words = s.split(" ");
    ArrayList<String> wordsList = new ArrayList<String>();
    Set<String> stopWordsSet = new HashSet<String>();
    stopWordsSet.add("I");
    stopWordsSet.add("THIS");
    stopWordsSet.add("AND");
    stopWordsSet.add("THERE'S");

    for(String word : words)
    {
        String wordCompare = word.toUpperCase();
        if(!stopWordsSet.contains(wordCompare))
        {
            wordsList.add(word);
        }
    }

    for (String str : wordsList){
        System.out.print(str+" ");
    }

OUTPUT: love phone, its super fast so much new cool things with jelly bean....but of recently I've seen some bugs.

输出：爱手机，它的超快和果冻豆这么多新的很酷的东西......但最近我看到了一些错误。

Answer 7

回答by Michal Lozinski

Try storing the stopwords in a set collection, and than tokenise your string to a list. You can afterwards simply use 'removeAll' to get the result.

尝试将停用词存储在集合中，然后将您的字符串标记为列表。之后您可以简单地使用“removeAll”来获得结果。

Set<String> stopwords = new Set<>()
//fill in the set with your file

String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
List<String> listOfStrings = asList(s.split(" "));

listOfStrings.removeAll(stopwords);
StringUtils.join(listOfStrings, " ");

No for loops needed - they usually mean problems.

不需要 for 循环 - 它们通常意味着问题。

Answer 8

回答by Navnath Chinchore

You can use replace All function like this

您可以像这样使用替换所有功能

String yourString ="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
yourString=yourString.replaceAll("stop" ,"");

Answer 9

回答by Inquisitor

It seems that you make a stop one stop word is removed in a sentence move to another stop word: you need to remove all stop words in each sentence.

似乎您使一个停止词在一个句子中被删除移动到另一个停止词：您需要删除每个句子中的所有停止词。

You should try changing your code:

您应该尝试更改代码：

From:

从：

for(int ii = 0; ii < wordsList.size(); ii++){
    for(int jj = 0; jj < k; jj++){
        if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
            wordsList.remove(ii);
            break;
        }
    }
}

To something like:

类似于：

for(int ii = 0; ii < wordsList.size(); ii++)
{
    for(int jj = 0; jj < k; jj++)
    {
        if(wordsList.get(ii).toLowerCase().contains(stopwords[jj])
        {
            wordsList.remove(ii);
        }
    }
}

Note that breakis removed and stopword.contains(word)is changed to word.contains(stopword).

请注意，break已删除并stopword.contains(word)更改为 word.contains(stopword)。

Answer 10

回答by Uttesh Kumar

Recently one of the project required the functionality to filter the stopping/stemm and swear words from the given text or file, after going through the few blogs and write-ups. created a simple library to filter data/file and made available in maven. hope this may help some one.

最近，在浏览了一些博客和文章后，其中一个项目需要从给定的文本或文件中过滤停止/词干和脏话的功能。创建了一个简单的库来过滤数据/文件并在 maven 中可用。希望这可以帮助某人。

https://github.com/uttesh/exude

     <dependency>
        <groupId>com.uttesh</groupId>
        <artifactId>exude</artifactId>
        <version>0.0.2</version>
    </dependency>

从 Java 中的字符串中删除停用词

提问by JavaLearner

采纳答案by alain.janinm

回答by Darshan Lila

回答by Vimal Bera

回答by geert3

回答by SMA

回答by robin

回答by Michal Lozinski

回答by Navnath Chinchore

回答by Inquisitor

From:

从：

To something like:

类似于：

回答by Uttesh Kumar

相关推荐

最近更新

标签

从 Java 中的字符串中删除停用词

提问by JavaLearner

采纳答案by alain.janinm

回答by Darshan Lila

回答by Vimal Bera

回答by geert3

回答by SMA

回答by robin

回答by Michal Lozinski

回答by Navnath Chinchore

回答by Inquisitor

From:

从：

To something like:

类似于：

回答by Uttesh Kumar

相关推荐

Java SecureRandom 与 NativePRNG 对比 SHA1PRNG

Java 如何将文本添加到 JLabel

Java Spring Data JPA：示例查询？

JSON 数组到 Java 对象

相关推荐

最近更新

标签