Python NLTK 停用词列表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22763224/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:37:10  来源:igfitidea点击:

NLTK Stopword List

pythonnltkstop-words

提问by saph_top

I have the code beneath and I am trying to apply a stop word list to list of words. However the results still show words such as "a" and "the" which I thought would have been removed by this process. Any ideas what has gone wrong would be great .

我有下面的代码,我正在尝试将停用词列表应用于单词列表。然而,结果仍然显示诸如“a”和“the”之类的词,我认为这些词会被此过程删除。任何出问题的想法都会很棒。

import nltk
from nltk.corpus import stopwords

word_list = open("xxx.y.txt", "r")
filtered_words = [w for w in word_list if not w in stopwords.words('english')]
print filtered_words

采纳答案by Hooked

A few things of note.

一些注意事项。

  • If you are going to be checking membership against a list over and over, I would use a set instead of a list.

  • stopwords.words('english')returns a list of lowercasestop words. It is quite likely that your source has capital letters in it and is not matching for that reason.

  • You aren't reading the file properly, you are checking over the file object not a list of the words split by spaces.

  • 如果您要根据列表一遍又一遍地检查成员资格,我会使用集合而不是列表。

  • stopwords.words('english')返回小写停用词列表。您的来源很可能有大写字母,因此不匹配。

  • 您没有正确读取文件,您正在检查文件对象而不是由空格分隔的单词列表。

Putting it all together:

把它们放在一起:

import nltk
from nltk.corpus import stopwords

word_list = open("xxx.y.txt", "r")
stops = set(stopwords.words('english'))

for line in word_list:
    for w in line.split():
        if w.lower() not in stops:
            print w