Python scikit-learn 中的 TFIDFVectorizer 应该如何工作?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36800654/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:22:03  来源:igfitidea点击:

How is the TFIDFVectorizer in scikit-learn supposed to work?

pythonnlpscikit-learn

提问by Jonathan

I'm trying to get words that are distinctive of certain documents using the TfIDFVectorizer class in scikit-learn. It creates a tfidf matrix with all the words and their scores in all the documents, but then it seems to count common words, as well. This is some of the code I'm running:

我正在尝试使用 scikit-learn 中的 TfIDFVectorizer 类来获取某些文档特有的单词。它创建了一个 tfidf 矩阵,其中包含所有文档中的所有单词及其分数,但它似乎也计算了常见单词。这是我正在运行的一些代码:

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(contents)
feature_names = vectorizer.get_feature_names()
dense = tfidf_matrix.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names, index=characters)
s = pd.Series(df.loc['Adam'])
s[s > 0].sort_values(ascending=False)[:10]

I expected this to return a list of distinctive words for the document 'Adam', but what it does it return a list of common words:

我希望这会返回文档“Adam”的独特单词列表,但它返回的是常用单词列表:

and     0.497077
to      0.387147
the     0.316648
of      0.298724
in      0.186404
with    0.144583
his     0.140998

I might not understand it perfectly, but as I understand it, tf-idf is supposed to find words that are distinctive of one document in a corpus, finding words that appear frequently in one document, but not in other documents. Here, andappears frequently in other documents, so I don't know why it's returning a high value here.

我可能不完全理解它,但据我所知,tf-idf 应该在语料库中查找与某个文档不同的单词,查找在一个文档中频繁出现但在其他文档中不常见的单词。这里,and在其他文档中经常出现,所以我不知道为什么它在这里返回高值。

The complete code I'm using to generate this is in this Jupyter notebook.

我用来生成它的完整代码在这个 Jupyter notebook 中

When I compute tf/idfs semi-manually, using the NLTK and computing scores for each word, I get the appropriate results. For the 'Adam' document:

当我半手动计算 tf/idfs 时,使用 NLTK 并计算每个单词的分数,我得到了适当的结果。对于“亚当”文件:

fresh        0.000813
prime        0.000813
bone         0.000677
relate       0.000677
blame        0.000677
enough       0.000677

That looks about right, since these are words that appear in the 'Adam' document, but not as much in other documents in the corpus. The complete code used to generate this is in this Jupyter notebook.

这看起来是正确的,因为这些词出现在“Adam”文档中,但在语料库中的其他文档中没有那么多。用于生成它的完整代码在这个 Jupyter notebook 中

Am I doing something wrong with the scikit code? Is there another way to initialize this class where it returns the right results? Of course, I can ignore stopwords by passing stop_words = 'english', but that doesn't really solve the problem, since common words of any sort shouldn't have high scores here.

我对 scikit 代码做错了吗?有没有另一种方法来初始化这个类,它会返回正确的结果?当然,我可以通过传递 忽略停用词stop_words = 'english',但这并不能真正解决问题,因为任何类型的常用词在这里都不应该有高分。

回答by Sagar Waghmode

From scikit-learn documentation:

来自 scikit-learn 文档:

As tf–idf is very often used for text features, there is also another class called TfidfVectorizer that combines all the options of CountVectorizer and TfidfTransformer in a single model.

由于 tf–idf 经常用于文本特征,因此还有另一个称为 TfidfVectorizer 的类,它将 CountVectorizer 和 TfidfTransformer 的所有选项组合在一个模型中。

As you can see, TfidfVectorizeris a CountVectorizerfollowed by TfidfTransformer.

如您所见,TfidfVectorizer是一个CountVectorizer后跟TfidfTransformer

What you are probably looking for is TfidfTransformerand not TfidfVectorizer

您可能正在寻找的是TfidfTransformer而不是TfidfVectorizer

回答by Rabbit

I believe your issue lies in using different stopword lists. Scikit-learn and NLTK use different stopword lists by default. For scikit-learn it is usually a good idea to have a custom stop_words list passed to TfidfVectorizer, e.g.:

我相信您的问题在于使用不同的停用词列表。Scikit-learn 和 NLTK 默认使用不同的停用词列表。对于 scikit-learn,将自定义 stop_words 列表传递给 TfidfVectorizer 通常是个好主意,例如:

my_stopword_list = ['and','to','the','of']
my_vectorizer = TfidfVectorizer(stop_words=my_stopword_list)

Doc page for TfidfVectorizer class: [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html][1]

TfidfVectorizer 类的文档页面:[ http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html][1]

回答by realmq

using below code I get much better results

使用下面的代码我得到了更好的结果

vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words='english')

Output

输出

sustain    0.045090
bone       0.045090
thou       0.044417
thee       0.043673
timely     0.043269
thy        0.042731
prime      0.041628
absence    0.041234
rib        0.041234
feel       0.040259
Name: Adam, dtype: float64

and

thee          0.071188
thy           0.070549
forbids       0.069358
thou          0.068068
early         0.064642
earliest      0.062229
dreamed       0.062229
firmness      0.062229
glistering    0.062229
sweet         0.060770
Name: Eve, dtype: float64

回答by Randy

I'm not sure why it's not the default, but you probably want sublinear_tf=Truein the initialization for TfidfVectorizer. I forked your repo and sent you a PR with an example that probably looks more like what you want.

我不确定为什么它不是默认值,但您可能希望sublinear_tf=True在 TfidfVectorizer 的初始化中使用它。我分叉了你的 repo 并向你发送了一个 PR,其中包含一个可能看起来更像你想要的例子。

回答by user2827262

The answer to your question may lie in the size of your corpus and source codes for different implementations. I haven't looked into the nltk code in detail, but 3-8 documents (in scikit code) are probably not big enough to construct a corpus. When constructing corpuses; news archives with with hundreds of thousands of articles or thousands of books are used. Maybe frequency of words like 'the' in 8 documents were not large overall to account for commonness of these words among those documents.

您的问题的答案可能取决于您的语料库和不同实现的源代码的大小。我没有详细研究过 nltk 代码,但是 3-8 个文档(在 scikit 代码中)可能不足以构建一个语料库。构建语料库时;使用具有数十万篇文章或数千本书的新闻档案。也许像“the”这样的词在 8 个文档中的出现频率总体上并不大,无法解释这些词在这些文档中的共性。

If you look at source codes, you might be able to find differences in implementation, whether they follow different normalization steps or frequency distributions (https://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.htmlhas common tfidf variants)

如果您查看源代码,您可能会发现实现上的差异,无论它们遵循不同的归一化步骤还是频率分布(https://nlp.stanford.edu/IR-book/html/htmledition/document-and-query -weighting-schemes-1.html有常见的tfidf变体)

Another thing that may help could be looking at the term frequencies (CountVectorizer in scikit) and making sure that words like 'the' are over represented in all documents.

另一件可能有帮助的事情可能是查看术语频率(scikit 中的 CountVectorizer)并确保像“the”这样的词在所有文档中都被过度表示。