Python 使用 Pandas 数据框获取 tfidf 的最简单方法是什么？

Question

提问by user1610952

I want to calculate tf-idf from the documents below. I'm using python and pandas.

我想从下面的文档中计算 tf-idf。我正在使用 python 和熊猫。

import pandas as pd
df = pd.DataFrame({'docId': [1,2,3], 
               'sent': ['This is the first sentence','This is the second sentence', 'This is the third sentence']})

First, I thought I would need to get word_count for each row. So I wrote a simple function:

首先，我认为我需要为每一行获取 word_count。所以我写了一个简单的函数：

def word_count(sent):
    word2cnt = dict()
    for word in sent.split():
        if word in word2cnt: word2cnt[word] += 1
        else: word2cnt[word] = 1
return word2cnt

And then, I applied it to each row.

然后，我将它应用到每一行。

df['word_count'] = df['sent'].apply(word_count)

But now I'm lost. I know there's an easy method to calculate tf-idf if I use Graphlab, but I want to stick with an open source option. Both Sklearn and gensim look overwhelming. What's the simplest solution to get tf-idf?

但现在我迷路了。我知道如果我使用 Graphlab，有一种简单的方法可以计算 tf-idf，但我想坚持使用开源选项。Sklearn 和 gensim 看起来都让人难以抗拒。获取 tf-idf 的最简单解决方案是什么？

Answer 1

回答by arthur

Scikit-learn implementation is really easy :

Scikit-learn 的实现非常简单：

from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df['sent'])

There are plenty of parameters you can specify. See the documentation here

您可以指定很多参数。请参阅此处的文档

The output of fit_transform will be a sparse matrix, if you want to visualize it you can do x.toarray()

fit_transform 的输出将是一个稀疏矩阵，如果您想将其可视化，您可以这样做 x.toarray()

In [44]: x.toarray()
Out[44]: 
array([[ 0.64612892,  0.38161415,  0.        ,  0.38161415,  0.38161415,
         0.        ,  0.38161415],
       [ 0.        ,  0.38161415,  0.64612892,  0.38161415,  0.38161415,
         0.        ,  0.38161415],
       [ 0.        ,  0.38161415,  0.        ,  0.38161415,  0.38161415,
         0.64612892,  0.38161415]])

Python 使用 Pandas 数据框获取 tfidf 的最简单方法是什么？

提问by user1610952

回答by arthur

相关推荐

最近更新

标签

Python 使用 Pandas 数据框获取 tfidf 的最简单方法是什么？

提问by user1610952

回答by arthur

相关推荐

Python 如何在异步函数中使用“产量”？

Python flask.cli.NoAppException：无法导入“flaskr.flaskr”

Python read_csv 没有正确读取此文件中的列名？

Python 类型错误：缺少一个必需的位置参数

相关推荐

最近更新

标签