java 如何删除Java中的停用词?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12469332/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 09:01:26  来源:igfitidea点击:

How to remove stop words in java?

javastop-words

提问by pamiers

I want to remove stop words in java.

我想删除 Java 中的停用词。

So, I read stop words from text file.

所以,我从文本文件中读取停用词。

and store Set

并存储 Set

Set<String> stopWords = new LinkedHashSet<String>();
BufferedReader br = new BufferedReader(new FileReader("stopwords.txt"));
        String words = null;
        while( (words = br.readLine()) != null) {
            stopWords.add(words.trim());
            }
        br.close();

And, I read another text file.

而且,我读了另一个文本文件。

So, I wanna remove to duplicate string in text file.

所以,我想删除文本文件中的重复字符串。

How can I?

我怎么能够?

采纳答案by Ashkrit Sharma

You want to remove duplicate words from file, below is the high level logic for same.

您想从文件中删除重复的单词,下面是相同的高级逻辑。

  • Read File
  • Loop through file content(i.e one line at a time)
    • Have string tokenizer for that line based on space
    • Add each each token to your set. This will make sure that you have only one entry per word.
    • Close file
  • 读取文件
  • 循环遍历文件内容(即一次一行)
    • 根据空间为该行设置字符串标记器
    • 将每个令牌添加到您的集合中。这将确保每个单词只有一个条目。
    • 关闭文件

Now you have set that contains all the unique word of file.

现在您已经设置了包含文件的所有唯一字。

回答by Adel

using setfor stopword :

使用set作为停用词:

Set<String> stopWords = new LinkedHashSet<String>();
        BufferedReader SW= new BufferedReader(new FileReader("StopWord.txt"));
        for(String line;(line = SW.readLine()) != null;)
           stopWords.add(line.trim());
        SW.close();

and ArrayList for input txt_file

和 ArrayList 用于输入 txt_file

BufferedReader br = new BufferedReader(new FileReader(txt_file.txt));
//make your arraylist here

// function deletStopWord() for remove all stopword in your "stopword.txt"
public ArrayList<String> deletStopWord(Set stopWords,ArrayList arraylist){
        System.out.println(stopWords.contains("?"));
        ArrayList<String> NewList = new ArrayList<String>();
        int i=3;
        while(i < arraylist.size() ){
            if(!stopWords.contains(arraylist.get(i))){
                NewList.add((String) arraylist.get(i));
            }
            i++;        
            }
        System.out.println(NewList);
        return NewList;
    }

  arraylist=deletStopWord(stopWords,arraylist);

回答by Sri Harsha Chilakapati

Using the ArrayListmay be more easier.

使用ArrayList可能更容易。

public ArrayList removeDuplicates(ArrayList source){
    ArrayList<String> newList = new ArrayList<String>();
    for (int i=0; i<source.size(); i++){
        String s = source.get(i);
        if (!newList.contains(s)){
            newList.add(s);
        }
    }
    return newList;
}

Hope this helps.

希望这可以帮助。

回答by Uttesh Kumar

it may be late reply, hope it may help someone few days back created the small util library to remove stop/stemmer words from the given text and its in maven repository/github

可能会迟到回复,希望它可以帮助几天前创建小型 util 库的人从给定的文本及其在 maven 存储库/github 中删除停止/词干词

exude library

散发图书馆

回答by Eric Wilson

If you simply want to remove a certain set of words from the words in a file, you can do it however you want. But if you are dealing with a problem involving natural language processing, you should use a library.

如果您只是想从文件中的单词中删除一组特定的单词,您可以随心所欲地进行。但是如果你正在处理一个涉及自然语言处理的问题,你应该使用一个库。

For example, using Lucenefor tokenizing will seem more complicated at first, but it will deal with myriad complications that you will overlook, and allow for great flexibility should you change your mind on the specific stopwords, on how you are tokenizing, whether you care about case, etc.

例如,使用Lucene进行标记化一开始看起来会更复杂,但它会处理您会忽略的无数复杂情况,并且如果您改变对特定停用词、标记化方式、是否关心的想法,则可以提供很大的灵活性关于案件等

回答by mP.

You should try using StringTokenizer.

您应该尝试使用StringTokenizer.