Python NLTK 停用词列表

Question

提问by saph_top

I have the code beneath and I am trying to apply a stop word list to list of words. However the results still show words such as "a" and "the" which I thought would have been removed by this process. Any ideas what has gone wrong would be great .

我有下面的代码，我正在尝试将停用词列表应用于单词列表。然而，结果仍然显示诸如“a”和“the”之类的词，我认为这些词会被此过程删除。任何出问题的想法都会很棒。

import nltk
from nltk.corpus import stopwords

word_list = open("xxx.y.txt", "r")
filtered_words = [w for w in word_list if not w in stopwords.words('english')]
print filtered_words

Answer 1

采纳答案by Hooked

A few things of note.

一些注意事项。

If you are going to be checking membership against a list over and over, I would use a set instead of a list.
stopwords.words('english')returns a list of lowercasestop words. It is quite likely that your source has capital letters in it and is not matching for that reason.
You aren't reading the file properly, you are checking over the file object not a list of the words split by spaces.

如果您要根据列表一遍又一遍地检查成员资格，我会使用集合而不是列表。
stopwords.words('english')返回小写停用词列表。您的来源很可能有大写字母，因此不匹配。
您没有正确读取文件，您正在检查文件对象而不是由空格分隔的单词列表。

Putting it all together:

把它们放在一起：

import nltk
from nltk.corpus import stopwords

word_list = open("xxx.y.txt", "r")
stops = set(stopwords.words('english'))

for line in word_list:
    for w in line.split():
        if w.lower() not in stops:
            print w

Python NLTK 停用词列表

提问by saph_top

采纳答案by Hooked

相关推荐

最近更新

标签

Python NLTK 停用词列表

提问by saph_top

采纳答案by Hooked

相关推荐

Python 如何打开.html文件？

python中的求和求值

在 Python 3.4 中“转换”为 int

在 Python 中将 ISO 8601 日期时间转换为秒

相关推荐

最近更新

标签