Python 将单词添加到 sklearn 中 TfidfVectorizer 中的 stop_words 列表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26826002/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:03:40  来源:igfitidea点击:

adding words to stop_words list in TfidfVectorizer in sklearn

pythonscikit-learnclassificationstop-wordstext-classification

提问by ac11

I want to add a few more words to stop_words in TfidfVectorizer. I followed the solution in Adding words to scikit-learn's CountVectorizer's stop list. My stop word list now contains both 'english' stop words and the stop words I specified. But still TfidfVectorizer does not accept my list of stop words and I can still see those words in my features list. Below is my code

我想在 TfidfVectorizer 中向 stop_words 添加更多单词。我遵循了将单词添加到 scikit-learn's CountVectorizer's stop list 中的解决方案。我的停用词列表现在包含“英语”停用词和我指定的停用词。但是 TfidfVectorizer 仍然不接受我的停用词列表,我仍然可以在我的功能列表中看到这些词。下面是我的代码

from sklearn.feature_extraction import text
my_stop_words = text.ENGLISH_STOP_WORDS.union(my_words)

vectorizer = TfidfVectorizer(analyzer=u'word',max_df=0.95,lowercase=True,stop_words=set(my_stop_words),max_features=15000)
X= vectorizer.fit_transform(text)

I have also tried to set stop_words in TfidfVectorizer as stop_words=my_stop_words . But still it does not work . Please help.

我还尝试将 TfidfVectorizer 中的 stop_words 设置为 stop_words=my_stop_words 。但它仍然不起作用。请帮忙。

回答by yanhan

This is answered here: https://stackoverflow.com/a/24386751/732396

这是在这里回答:https: //stackoverflow.com/a/24386751/732396

Even though sklearn.feature_extraction.text.ENGLISH_STOP_WORDSis a frozenset, you can make a copy of it and add your own words, then pass that variable in to the stop_wordsargument as a list.

即使sklearn.feature_extraction.text.ENGLISH_STOP_WORDS是一个frozenset,您也可以复制它并添加您自己的单词,然后将该变量stop_words作为列表传递给参数。

回答by Pedram

Here is an example:

下面是一个例子:

from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer

my_stop_words = text.ENGLISH_STOP_WORDS.union(["book"])

vectorizer = TfidfVectorizer(ngram_range=(1,1), stop_words=my_stop_words)

X = vectorizer.fit_transform(["this is an apple.","this is a book."])

idf_values = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))

# printing the tfidf vectors
print(X)

# printing the vocabulary
print(vectorizer.vocabulary_)

In this example, I created the tfidf vectors for two sample documents:

在本例中,我为两个示例文档创建了 tfidf 向量:

"This is a green apple."
"This is a machine learning book."

By default, this, is, a, and anare all in the ENGLISH_STOP_WORDSlist. And, I also added bookto the stop word list. This is the output:

默认情况下,thisisaan都在ENGLISH_STOP_WORDS列表中。而且,我还添加book到停用词列表中。这是输出:

(0, 1)  0.707106781187
(0, 0)  0.707106781187
(1, 3)  0.707106781187
(1, 2)  0.707106781187
{'green': 1, 'machine': 3, 'learning': 2, 'apple': 0}

As we can see, the word bookis also removed from the list of features because we listed it as a stop word. As a result, tfidfvectorizer did accept the manually added word as a stop word and ignored the word at the time of creating the vectors.

如我们所见,该词book也从特征列表中删除,因为我们将其列为停用词。结果,tfidfvectorizer 确实接受了手动添加的词作为停用词,并在创建向量时忽略了该词。

回答by user2589273

For use with scikit-learn you can always use a list as-well:

要与 scikit-learn 一起使用,您也可以始终使用列表:

from nltk.corpus import stopwords
stop = list(stopwords.words('english'))
stop.extend('myword1 myword2 myword3'.split())


vectorizer = TfidfVectorizer(analyzer = 'word',stop_words=set(stop))
vectors = vectorizer.fit_transform(corpus)
...

The only downside of this method, over a set is that your list may end up containing duplicates, which is why I then convert it back when using it as an argument for TfidfVectorizer

这种方法的唯一缺点是你的列表可能最终包含重复项,这就是为什么我在使用它作为参数时将它转换回来 TfidfVectorizer