Python 如何在 scikit-learn 中的 tfidf 之后查看术语文档矩阵的前 n 个条目
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25217510/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to see top n entries of term-document matrix after tfidf in scikit-learn
提问by Amrith Krishna
I am new to scikit-learn, and I was using TfidfVectorizerto find the tfidf values of terms in a set of documents. I used the following code to obtain the same.
我是 scikit-learn 的新手,我曾经TfidfVectorizer在一组文档中查找术语的 tfidf 值。我使用以下代码来获得相同的结果。
vectorizer = TfidfVectorizer(stop_words=u'english',ngram_range=(1,5),lowercase=True)
X = vectorizer.fit_transform(lectures)
Now If I print X, I am able to see all the entries in matrix, but how can I find top n entries based on tfidf score. In addition to that is there any method that will help me to find top n entries based on tfidf score per ngram i.e. top entries among unigram,bigram,trigram and so on?
现在,如果我打印 X,我可以看到矩阵中的所有条目,但是如何根据 tfidf 分数找到前 n 个条目。除此之外,是否有任何方法可以帮助我根据每 ngram 的 tfidf 分数找到前 n 个条目,即 unigram、bigram、trigram 等中的顶级条目?
采纳答案by YS-L
Since version 0.15, the global term weighting of the features learnt by a TfidfVectorizercan be accessed through the attribute idf_, which will return an array of length equal to the feature dimension. Sort the features by this weighting to get the top weighted features:
从 0.15 版本开始,TfidfVectorizer可以通过属性访问a 学习的特征的全局术语权重,该属性idf_将返回一个长度等于特征维度的数组。按此权重对特征进行排序以获得最高权重的特征:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
lectures = ["this is some food", "this is some drink"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(lectures)
indices = np.argsort(vectorizer.idf_)[::-1]
features = vectorizer.get_feature_names()
top_n = 2
top_features = [features[i] for i in indices[:top_n]]
print top_features
Output:
输出:
[u'food', u'drink']
The second problem of getting the top features by ngram can be done using the same idea, with some extra steps of splitting the features into different groups:
通过 ngram 获取顶级特征的第二个问题可以使用相同的想法来完成,还有一些额外的步骤将特征分成不同的组:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict
lectures = ["this is some food", "this is some drink"]
vectorizer = TfidfVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(lectures)
features_by_gram = defaultdict(list)
for f, w in zip(vectorizer.get_feature_names(), vectorizer.idf_):
features_by_gram[len(f.split(' '))].append((f, w))
top_n = 2
for gram, features in features_by_gram.iteritems():
top_features = sorted(features, key=lambda x: x[1], reverse=True)[:top_n]
top_features = [f[0] for f in top_features]
print '{}-gram top:'.format(gram), top_features
Output:
输出:
1-gram top: [u'drink', u'food']
2-gram top: [u'some drink', u'some food']

