Python 使用 Pandas 数据框获取 tfidf 的最简单方法是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37593293/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:36:22  来源:igfitidea点击:

What is the simplest way to get tfidf with pandas dataframe?

pythonpandasscikit-learntf-idfgensim

提问by user1610952

I want to calculate tf-idf from the documents below. I'm using python and pandas.

我想从下面的文档中计算 tf-idf。我正在使用 python 和熊猫。

import pandas as pd
df = pd.DataFrame({'docId': [1,2,3], 
               'sent': ['This is the first sentence','This is the second sentence', 'This is the third sentence']})

First, I thought I would need to get word_count for each row. So I wrote a simple function:

首先,我认为我需要为每一行获取 word_count。所以我写了一个简单的函数:

def word_count(sent):
    word2cnt = dict()
    for word in sent.split():
        if word in word2cnt: word2cnt[word] += 1
        else: word2cnt[word] = 1
return word2cnt

And then, I applied it to each row.

然后,我将它应用到每一行。

df['word_count'] = df['sent'].apply(word_count)

But now I'm lost. I know there's an easy method to calculate tf-idf if I use Graphlab, but I want to stick with an open source option. Both Sklearn and gensim look overwhelming. What's the simplest solution to get tf-idf?

但现在我迷路了。我知道如果我使用 Graphlab,有一种简单的方法可以计算 tf-idf,但我想坚持使用开源选项。Sklearn 和 gensim 看起来都让人难以抗拒。获取 tf-idf 的最简单解决方案是什么?

回答by arthur

Scikit-learn implementation is really easy :

Scikit-learn 的实现非常简单:

from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df['sent'])

There are plenty of parameters you can specify. See the documentation here

您可以指定很多参数。请参阅此处的文档

The output of fit_transform will be a sparse matrix, if you want to visualize it you can do x.toarray()

fit_transform 的输出将是一个稀疏矩阵,如果您想将其可视化,您可以这样做 x.toarray()

In [44]: x.toarray()
Out[44]: 
array([[ 0.64612892,  0.38161415,  0.        ,  0.38161415,  0.38161415,
         0.        ,  0.38161415],
       [ 0.        ,  0.38161415,  0.64612892,  0.38161415,  0.38161415,
         0.        ,  0.38161415],
       [ 0.        ,  0.38161415,  0.        ,  0.38161415,  0.38161415,
         0.64612892,  0.38161415]])