Python tf-idf 特征权重使用 sklearn.feature_extraction.text.TfidfVectorizer

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23792781/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:27:28  来源:igfitidea点击:

tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer

pythonscikit-learntf-idf

提问by fast tooth

this page: http://scikit-learn.org/stable/modules/feature_extraction.htmlmentions:

此页面:http: //scikit-learn.org/stable/modules/feature_extraction.html提到:

As tf–idf is a very often used for text features, there is also another class called TfidfVectorizerthat combines all the option of CountVectorizerand TfidfTransformerin a single model.

由于 tf–idf 经常用于文本特征,因此还有另一个称为TfidfVectorizer 的类,它将CountVectorizerTfidfTransformer 的所有选项组合在一个模型中。

then I followed the code and use fit_transform() on my corpus. How to get the weight of each feature computed by fit_transform()?

然后我按照代码并在我的语料库上使用 fit_transform() 。如何获得fit_transform()计算的每个特征的权重?

I tried:

我试过:

In [39]: vectorizer.idf_
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-39-5475eefe04c0> in <module>()
----> 1 vectorizer.idf_

AttributeError: 'TfidfVectorizer' object has no attribute 'idf_'

but this attribute is missing.

但缺少此属性。

Thanks

谢谢

采纳答案by YS-L

Since version 0.15, the tf-idf score of each feature can be retrieved via the attribute idf_of the TfidfVectorizerobject:

由于0.15版本,每个特征的TF-IDF评分可以通过属性来检索idf_所述的TfidfVectorizer对象:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
          "This is very nice"]
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

Output:

输出:

{u'is': 1.0,
 u'nice': 1.4054651081081644,
 u'strange': 1.4054651081081644,
 u'this': 1.0,
 u'very': 1.0}


As discussed in the comments, prior to version 0.15, a workaround is to access the attribute idf_via the supposedly hidden _tfidf(an instance of TfidfTransformer) of the vectorizer:

正如评论中所讨论的,在 0.15 版本之前,一种解决方法是idf_通过向量化器的假定隐藏_tfidf(的实例TfidfTransformer)访问属性:

idf = vectorizer._tfidf.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

which should give the same output as above.

这应该给出与上面相同的输出。

回答by aless80

See also thison how to get the TF-IDF values of all the documents:

又见如何让所有的文件的TF-IDF值:

feature_names = tf.get_feature_names()
doc = 0
feature_index = X[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [X[doc, x] for x in feature_index])
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
    print w, s

this 0.448320873199
is 0.448320873199
very 0.448320873199
strange 0.630099344518

#and for doc=1
this 0.448320873199
is 0.448320873199
very 0.448320873199
nice 0.630099344518

I think the results are normalized by document:

我认为结果按文档标准化:

>>>0.4483208731992+0.4483208731992+0.4483208731992+0.6300993445182 0.9999999999997548

>>>0.448320873199 2+0.4483208731992+0.448320873199 2+0.6300993445182 0.999999999997548