Python 使用 Pandas 数据框获取 tfidf 的最简单方法是什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37593293/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is the simplest way to get tfidf with pandas dataframe?
提问by user1610952
I want to calculate tf-idf from the documents below. I'm using python and pandas.
我想从下面的文档中计算 tf-idf。我正在使用 python 和熊猫。
import pandas as pd
df = pd.DataFrame({'docId': [1,2,3],
'sent': ['This is the first sentence','This is the second sentence', 'This is the third sentence']})
First, I thought I would need to get word_count for each row. So I wrote a simple function:
首先,我认为我需要为每一行获取 word_count。所以我写了一个简单的函数:
def word_count(sent):
word2cnt = dict()
for word in sent.split():
if word in word2cnt: word2cnt[word] += 1
else: word2cnt[word] = 1
return word2cnt
And then, I applied it to each row.
然后,我将它应用到每一行。
df['word_count'] = df['sent'].apply(word_count)
But now I'm lost. I know there's an easy method to calculate tf-idf if I use Graphlab, but I want to stick with an open source option. Both Sklearn and gensim look overwhelming. What's the simplest solution to get tf-idf?
但现在我迷路了。我知道如果我使用 Graphlab,有一种简单的方法可以计算 tf-idf,但我想坚持使用开源选项。Sklearn 和 gensim 看起来都让人难以抗拒。获取 tf-idf 的最简单解决方案是什么?
回答by arthur
Scikit-learn implementation is really easy :
Scikit-learn 的实现非常简单:
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df['sent'])
There are plenty of parameters you can specify. See the documentation here
您可以指定很多参数。请参阅此处的文档
The output of fit_transform will be a sparse matrix, if you want to visualize it you can do x.toarray()
fit_transform 的输出将是一个稀疏矩阵,如果您想将其可视化,您可以这样做 x.toarray()
In [44]: x.toarray()
Out[44]:
array([[ 0.64612892, 0.38161415, 0. , 0.38161415, 0.38161415,
0. , 0.38161415],
[ 0. , 0.38161415, 0.64612892, 0.38161415, 0.38161415,
0. , 0.38161415],
[ 0. , 0.38161415, 0. , 0.38161415, 0.38161415,
0.64612892, 0.38161415]])