Python Scikit Learn TfidfVectorizer:如何获得具有最高 tf-idf 分数的前 n 个术语

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34232190/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 14:40:11  来源:igfitidea点击:

Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

pythonscikit-learnnlpnltktf-idf

提问by AbtPst

I am working on keyword extraction problem. Consider the very general case

我正在研究关键字提取问题。考虑非常普遍的情况

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')

t = """Two Travellers, walking in the noonday sun, sought the shade of a widespreading tree to rest. As they lay looking up among the pleasant leaves, they saw that it was a Plane Tree.

"How useless is the Plane!" said one of them. "It bears no fruit whatever, and only serves to litter the ground with leaves."

"Ungrateful creatures!" said a voice from the Plane Tree. "You lie here in my cooling shade, and yet you say I am useless! Thus ungratefully, O Jupiter, do men receive their blessings!"

Our best blessings are often the least appreciated."""

tfs = tfidf.fit_transform(t.split(" "))
str = 'tree cat travellers fruit jupiter'
response = tfidf.transform([str])
feature_names = tfidf.get_feature_names()

for col in response.nonzero()[1]:
    print(feature_names[col], ' - ', response[0, col])

and this gives me

这给了我

  (0, 28)   0.443509712811
  (0, 27)   0.517461475101
  (0, 8)    0.517461475101
  (0, 6)    0.517461475101
tree  -  0.443509712811
travellers  -  0.517461475101
jupiter  -  0.517461475101
fruit  -  0.517461475101

which is good. For any new document that comes in, is there a way to get the top n terms with the highest tfidf score?

这很好。对于任何传入的新文档,有没有办法获得 tfidf 得分最高的前 n 个术语?

采纳答案by hume

You have to do a little bit of a song and dance to get the matrices as numpy arrays instead, but this should do what you're looking for:

您必须做一些歌曲和舞蹈才能将矩阵改为 numpy 数组,但这应该可以满足您的需求:

feature_array = np.array(tfidf.get_feature_names())
tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1]

n = 3
top_n = feature_array[tfidf_sorting][:n]

This gives me:

这给了我:

array([u'fruit', u'travellers', u'jupiter'], 
  dtype='<U13')

The argsortcall is really the useful one, here are the docs for it. We have to do [::-1]because argsortonly supports sorting small to large. We call flattento reduce the dimensions to 1d so that the sorted indices can be used to index the 1d feature array. Note that including the call to flattenwill only work if you're testing one document at at time.

这个argsort电话真的很有用,这里是它的文档。我们必须这样做,[::-1]因为argsort只支持从小到大排序。我们调用flatten将维度减少到 1d,以便排序的索引可用于索引 1d 特征数组。请注意,flatten仅当您一次测试一个文档时,才包含对 的调用。

Also, on another note, did you mean something like tfs = tfidf.fit_transform(t.split("\n\n"))? Otherwise, each term in the multiline string is being treated as a "document". Using \n\ninstead means that we are actually looking at 4 documents (one for each line), which makes more sense when you think about tfidf.

另外,在另一个注意事项中,您的意思是这样的tfs = tfidf.fit_transform(t.split("\n\n"))吗?否则,多行字符串中的每个术语都被视为“文档”。使用\n\n相反意味着我们实际上正在查看 4 个文档(每行一个),这在您考虑 tfidf 时更有意义。

回答by Venkatachalam

Solution using sparse matrix itself (without .toarray())!

使用稀疏矩阵本身的解决方案(没有.toarray())!

tfidf = TfidfVectorizer(stop_words='english')
corpus = [
    'I would like to check this document',
    'How about one more document',
    'Aim is to capture the key words from the corpus',
    'frequency of words in a document is called term frequency'
]

X = tfidf.fit_transform(corpus)
feature_names = np.array(tfidf.get_feature_names())


new_doc = ['can key words in this new document be identified?',
           'idf is the inverse document frequency caculcated for each of the words']
responses = tfidf.transform(new_doc)


def get_top_tf_idf_words(response, top_n=2):
    sorted_nzs = np.argsort(response.data)[:-(top_n+1):-1]
    return feature_names[response.indices[sorted_nzs]]

print([get_top_tf_idf_words(response,2) for response in responses])

#[array(['key', 'words'], dtype='<U9'),
 array(['frequency', 'words'], dtype='<U9')]