pandas 在熊猫数据框中插入 sklearn CountVectorizer 的结果
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40370800/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Insert result of sklearn CountVectorizer in a pandas dataframe
提问by Saurabh Sood
I have a bunch of 14784 text documents, which I am trying to vectorize, so I can run some analysis. I used the CountVectorizer
in sklearn, to convert the documents to feature vectors. I did this by calling:
我有一堆 14784 个文本文档,我正在尝试对其进行矢量化,因此我可以进行一些分析。我使用CountVectorizer
in sklearn 将文档转换为特征向量。我通过调用做到了这一点:
vectorizer = CountVectorizer
features = vectorizer.fit_transform(examples)
where examples is an array of all the text documents
其中 examples 是所有文本文档的数组
Now, I am trying to use additional features. For this, I am storing the features in a pandas dataframe. At present, my pandas dataframe(without inserting the text features) has the shape (14784, 5)
. The shape of my feature vector is (14784, 21343)
.
现在,我正在尝试使用其他功能。为此,我将特征存储在Pandas数据框中。目前,我的 Pandas 数据框(不插入文本特征)具有 shape (14784, 5)
。我的特征向量的形状是(14784, 21343)
。
What would be a good way to insert the vectorized features into the pandas dataframe?
将矢量化特征插入Pandas数据帧的好方法是什么?
回答by Nickil Maveli
Return term-document matrix after learning the vocab dictionary from the raw documents.
从原始文档中学习词汇词典后返回术语文档矩阵。
X = vect.fit_transform(docs)
Convert sparse csr matrix to dense format and allow columns to contain the array mapping from feature integer indices to feature names.
将稀疏 csr 矩阵转换为密集格式,并允许列包含从特征整数索引到特征名称的数组映射。
count_vect_df = pd.DataFrame(X.todense(), columns=vect.get_feature_names())
Concatenate the original df
and the count_vect_df
columnwise.
连接原始df
和列count_vect_df
。
pd.concat([df, count_vect_df], axis=1)
回答by Tchotchke
If your base data frame is df
, all you need to do is:
如果您的基本数据框是df
,您需要做的就是:
import pandas as pd
features_df = pd.DataFrame(features)
combined_df = pd.concat([df, features_df], axis=1)
I'd recommend some options to reduce the number of features, which could be useful depending on what type of analysis you're doing. For example, if you haven't already, I'd suggest looking into removing stop words and stemming. Additionally you can set max_features, like features = vectorizer.fit_transform(examples, max_features = 1000)
to limit the number of features.
我会推荐一些减少功能数量的选项,这取决于您正在执行的分析类型。例如,如果您还没有,我建议您考虑删除停用词和词干。此外,您可以设置 max_features,比如features = vectorizer.fit_transform(examples, max_features = 1000)
限制功能的数量。