pandas 在熊猫数据框中插入 sklearn CountVectorizer 的结果

Question

提问by Saurabh Sood

I have a bunch of 14784 text documents, which I am trying to vectorize, so I can run some analysis. I used the CountVectorizerin sklearn, to convert the documents to feature vectors. I did this by calling:

我有一堆 14784 个文本文档，我正在尝试对其进行矢量化，因此我可以进行一些分析。我使用CountVectorizerin sklearn 将文档转换为特征向量。我通过调用做到了这一点：

vectorizer = CountVectorizer
features = vectorizer.fit_transform(examples)

where examples is an array of all the text documents

其中 examples 是所有文本文档的数组

Now, I am trying to use additional features. For this, I am storing the features in a pandas dataframe. At present, my pandas dataframe(without inserting the text features) has the shape (14784, 5). The shape of my feature vector is (14784, 21343).

现在，我正在尝试使用其他功能。为此，我将特征存储在Pandas数据框中。目前，我的 Pandas 数据框（不插入文本特征）具有 shape (14784, 5)。我的特征向量的形状是(14784, 21343)。

What would be a good way to insert the vectorized features into the pandas dataframe?

将矢量化特征插入Pandas数据帧的好方法是什么？

Answer 1

回答by Nickil Maveli

Return term-document matrix after learning the vocab dictionary from the raw documents.

从原始文档中学习词汇词典后返回术语文档矩阵。

X = vect.fit_transform(docs)

Convert sparse csr matrix to dense format and allow columns to contain the array mapping from feature integer indices to feature names.

将稀疏 csr 矩阵转换为密集格式，并允许列包含从特征整数索引到特征名称的数组映射。

count_vect_df = pd.DataFrame(X.todense(), columns=vect.get_feature_names())

Concatenate the original dfand the count_vect_dfcolumnwise.

连接原始df和列count_vect_df。

pd.concat([df, count_vect_df], axis=1)

Answer 2

回答by Tchotchke

If your base data frame is df, all you need to do is:

如果您的基本数据框是df，您需要做的就是：

import pandas as pd    
features_df = pd.DataFrame(features)
combined_df = pd.concat([df, features_df], axis=1)

I'd recommend some options to reduce the number of features, which could be useful depending on what type of analysis you're doing. For example, if you haven't already, I'd suggest looking into removing stop words and stemming. Additionally you can set max_features, like features = vectorizer.fit_transform(examples, max_features = 1000)to limit the number of features.

我会推荐一些减少功能数量的选项，这取决于您正在执行的分析类型。例如，如果您还没有，我建议您考虑删除停用词和词干。此外，您可以设置 max_features，比如features = vectorizer.fit_transform(examples, max_features = 1000)限制功能的数量。

pandas 在熊猫数据框中插入 sklearn CountVectorizer 的结果

提问by Saurabh Sood

回答by Nickil Maveli

回答by Tchotchke

相关推荐

最近更新

标签

pandas 在熊猫数据框中插入 sklearn CountVectorizer 的结果

提问by Saurabh Sood

回答by Nickil Maveli

回答by Tchotchke

相关推荐

如何在 Pandas 中创建多索引

如何在 Pandas Dataframe 中增量添加行？

pandas 使用 Python 计算 OHLC 数据的平均真实范围 (ATR)

pandas GridSearchCV：“类型错误：‘StratifiedKFold’对象不可迭代”

相关推荐

最近更新

标签