Python NLTK 停用词列表
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22763224/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
NLTK Stopword List
提问by saph_top
I have the code beneath and I am trying to apply a stop word list to list of words. However the results still show words such as "a" and "the" which I thought would have been removed by this process. Any ideas what has gone wrong would be great .
我有下面的代码,我正在尝试将停用词列表应用于单词列表。然而,结果仍然显示诸如“a”和“the”之类的词,我认为这些词会被此过程删除。任何出问题的想法都会很棒。
import nltk
from nltk.corpus import stopwords
word_list = open("xxx.y.txt", "r")
filtered_words = [w for w in word_list if not w in stopwords.words('english')]
print filtered_words
采纳答案by Hooked
A few things of note.
一些注意事项。
If you are going to be checking membership against a list over and over, I would use a set instead of a list.
stopwords.words('english')
returns a list of lowercasestop words. It is quite likely that your source has capital letters in it and is not matching for that reason.You aren't reading the file properly, you are checking over the file object not a list of the words split by spaces.
如果您要根据列表一遍又一遍地检查成员资格,我会使用集合而不是列表。
stopwords.words('english')
返回小写停用词列表。您的来源很可能有大写字母,因此不匹配。您没有正确读取文件,您正在检查文件对象而不是由空格分隔的单词列表。
Putting it all together:
把它们放在一起:
import nltk
from nltk.corpus import stopwords
word_list = open("xxx.y.txt", "r")
stops = set(stopwords.words('english'))
for line in word_list:
for w in line.split():
if w.lower() not in stops:
print w