从 Java 中的字符串中删除停用词
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27685839/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Removing stopwords from a String in Java
提问by JavaLearner
I have a string with lots of words and I have a text file which contains some Stopwords which I need to remove from my String. Let's say I have a String
我有一个包含很多单词的字符串,我有一个文本文件,其中包含一些我需要从字符串中删除的停用词。假设我有一个字符串
s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
After removing stopwords, string should be like :
删除停用词后,字符串应如下所示:
"love phone, super fast much cool jelly bean....but recently bugs."
I have been able to achieve this but the problem I am facing is that whenver there are adjacent stopwords in the String its removing only the first and I am getting result as :
我已经能够做到这一点,但我面临的问题是,只要字符串中有相邻的停用词,它只会删除第一个,我得到的结果如下:
"love phone, super fast there's much and cool with jelly bean....but recently seen bugs"
Here's my stopwordslist.txt file : Stopwords
这是我的stopwordslist.txt文件: Stopwords
How can I solve this problem. Here's what I have done so far :
我怎么解决这个问题。这是我到目前为止所做的:
int k=0,i,j;
ArrayList<String> wordsList = new ArrayList<String>();
String sCurrentLine;
String[] stopwords = new String[2000];
try{
FileReader fr=new FileReader("F:\stopwordslist.txt");
BufferedReader br= new BufferedReader(fr);
while ((sCurrentLine = br.readLine()) != null){
stopwords[k]=sCurrentLine;
k++;
}
String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
StringBuilder builder = new StringBuilder(s);
String[] words = builder.toString().split("\s");
for (String word : words){
wordsList.add(word);
}
for(int ii = 0; ii < wordsList.size(); ii++){
for(int jj = 0; jj < k; jj++){
if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
wordsList.remove(ii);
break;
}
}
}
for (String str : wordsList){
System.out.print(str+" ");
}
}catch(Exception ex){
System.out.println(ex);
}
采纳答案by alain.janinm
The error is because you remove element from the list you iterate on.
Let says you have wordsList
that contains |word0|word1|word2|
If ii
is equal to 1
and the if test is true, then you call wordsList.remove(1);
. After that your list is |word0|word2|
. ii
is then incremented and is equal to 2
and now it's above the size of your list, hence word2
will never be tested.
错误是因为您从迭代的列表中删除了元素。假设您有wordsList
包含|word0|word1|word2|
Ifii
等于1
并且 if 测试为真,那么您调用wordsList.remove(1);
. 之后,您的列表是|word0|word2|
. ii
然后递增并等于2
,现在它大于列表的大小,因此word2
永远不会被测试。
From there there is several solutions. For example instead of removing values you can set value to "". Or create a special "result" list.
从那里有几种解决方案。例如,您可以将值设置为“”,而不是删除值。或者创建一个特殊的“结果”列表。
回答by Darshan Lila
Here's try it following way:
下面是尝试它的方式:
String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
String stopWords[]={"love","this","cool"};
for(int i=0;i<stopWords.length;i++){
if(s.contains(stopWords[i])){
s=s.replaceAll(stopWords[i]+"\s+", ""); //note this will remove spaces at the end
}
}
System.out.println(s);
This way you final output will be without the words you don't want in it. Just get a list of stop words in an array and replace in required string.
Output for my stopwords:
这样你的最终输出将没有你不想要的词。只需获取数组中的停用词列表并替换所需的字符串即可。
我的停用词的输出:
I phone, its super fast and there's so much new and things with jelly bean....but of recently I've seen some bugs.
回答by Vimal Bera
Instead why don't you use below approach. It will be easier to read and understand :
相反,为什么不使用以下方法。阅读和理解会更容易:
for(String word : words){
s = s.replace(word+"\s*", "");
}
System.out.println(s);//It will print removed word string.
回答by geert3
This is a much more elegant solution (IMHO), using only regular expressions:
这是一个更优雅的解决方案(恕我直言),仅使用正则表达式:
// instead of the ".....", add all your stopwords, separated by "|"
// "\b" is to account for word boundaries, i.e. not replace "his" in "this"
// the "\s?" is to suppress optional trailing white space
Pattern p = Pattern.compile("\b(I|this|its.....)\b\s?");
Matcher m = p.matcher("I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.");
String s = m.replaceAll("");
System.out.println(s);
回答by SMA
Try using replaceAllapi of String like:
尝试使用String 的replaceAllapi,例如:
String myString = "I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
String stopWords = "I|its|with|but";
String afterStopWords = myString.replaceAll("(" + stopWords + ")\s*", "");
System.out.println(afterStopWords);
OUTPUT:
love this phone, super fast and there's so much new and cool things jelly bean....of recently 've seen some bugs.
回答by robin
Try the program below.
试试下面的程序。
String s="I love this phone, its super fast and there's so" +
" much new and cool things with jelly bean....but of recently I've seen some bugs.";
String[] words = s.split(" ");
ArrayList<String> wordsList = new ArrayList<String>();
Set<String> stopWordsSet = new HashSet<String>();
stopWordsSet.add("I");
stopWordsSet.add("THIS");
stopWordsSet.add("AND");
stopWordsSet.add("THERE'S");
for(String word : words)
{
String wordCompare = word.toUpperCase();
if(!stopWordsSet.contains(wordCompare))
{
wordsList.add(word);
}
}
for (String str : wordsList){
System.out.print(str+" ");
}
OUTPUT: love phone, its super fast so much new cool things with jelly bean....but of recently I've seen some bugs.
输出:爱手机,它的超快和果冻豆这么多新的很酷的东西......但最近我看到了一些错误。
回答by Michal Lozinski
Try storing the stopwords in a set collection, and than tokenise your string to a list. You can afterwards simply use 'removeAll' to get the result.
尝试将停用词存储在集合中,然后将您的字符串标记为列表。之后您可以简单地使用“removeAll”来获得结果。
Set<String> stopwords = new Set<>()
//fill in the set with your file
String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
List<String> listOfStrings = asList(s.split(" "));
listOfStrings.removeAll(stopwords);
StringUtils.join(listOfStrings, " ");
No for loops needed - they usually mean problems.
不需要 for 循环 - 它们通常意味着问题。
回答by Navnath Chinchore
You can use replace All function like this
您可以像这样使用替换所有功能
String yourString ="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
yourString=yourString.replaceAll("stop" ,"");
回答by Inquisitor
It seems that you make a stop one stop word is removed in a sentence move to another stop word: you need to remove all stop words in each sentence.
似乎您使一个停止词在一个句子中被删除移动到另一个停止词:您需要删除每个句子中的所有停止词。
You should try changing your code:
您应该尝试更改代码:
From:
从:
for(int ii = 0; ii < wordsList.size(); ii++){
for(int jj = 0; jj < k; jj++){
if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
wordsList.remove(ii);
break;
}
}
}
To something like:
类似于:
for(int ii = 0; ii < wordsList.size(); ii++)
{
for(int jj = 0; jj < k; jj++)
{
if(wordsList.get(ii).toLowerCase().contains(stopwords[jj])
{
wordsList.remove(ii);
}
}
}
Note that break
is removed and stopword.contains(word)
is changed to word.contains(stopword)
.
请注意,break
已删除并stopword.contains(word)
更改为 word.contains(stopword)
。
回答by Uttesh Kumar
Recently one of the project required the functionality to filter the stopping/stemm and swear words from the given text or file, after going through the few blogs and write-ups. created a simple library to filter data/file and made available in maven. hope this may help some one.
最近,在浏览了一些博客和文章后,其中一个项目需要从给定的文本或文件中过滤停止/词干和脏话的功能。创建了一个简单的库来过滤数据/文件并在 maven 中可用。希望这可以帮助某人。
https://github.com/uttesh/exude
https://github.com/uttesh/exude
<dependency>
<groupId>com.uttesh</groupId>
<artifactId>exude</artifactId>
<version>0.0.2</version>
</dependency>