Python 为 Scikit-Learn 向量化 Pandas 数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20024584/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 19:17:43  来源:igfitidea点击:

Vectorizing a Pandas dataframe for Scikit-Learn

pythonpandasscikit-learn

提问by Amelio Vazquez-Reina

Say I have a dataframe in Pandas like the following:

假设我在 Pandas 中有一个数据框,如下所示:

> my_dataframe

col1   col2
A      foo
B      bar
C      something
A      foo
A      bar
B      foo

where rows represent instances, and columns input features (not showing the target label, but this would be for a classification task), i.e. I trying to build Xout of my_dataframe.

其中,行代表实例,并试图构建列输入功能(未显示有目标的标签,但这将是一个分类任务),也就是我X出来的my_dataframe

How can I vectorize this efficiently using e.g. DictVectorizer?

如何使用 eg 有效地矢量化DictVectorizer

Do I need to convert each and every entry in my DataFrame to a dictionary first? (that's the way it is done in the example in the link above). Is there a more efficient way to do this?

我是否需要先将 DataFrame 中的每个条目都转换为字典?(这是在上面链接中的示例中完成的方式)。有没有更有效的方法来做到这一点?

回答by alko

First, I don't get where in your sample array are features, and where observations.

首先,我不知道您的样本数组中的哪些位置是特征,以及观察结果的位置。

Second, DictVectorizerholds no data, and is only about transformation utility and metadata storage. After transformation it stores features names and mapping. It returns a numpy array, used for further computations. Numpy array (features matrix) size equals to features countx number of observations, with values equal to feature value for an observation. So if you know your observations and features, you can create this array any other way you like.

其次,不DictVectorizer持有数据,仅与转换实用程序和元数据存储有关。转换后,它存储功能名称和映射。它返回一个 numpy 数组,用于进一步计算。Numpy 数组(特征矩阵)大小等于features countx number of observations,其值等于观察的特征值。因此,如果您知道自己的观察结果和特征,则可以按照您喜欢的任何其他方式创建此数组。

In case you expect sklearn do it for you, you don't have to reconstruct dict manually, as it can be done with to_dictapplied to transposed dataframe:

如果您希望 sklearn 为您做这件事,您不必手动重建 dict,因为它可以通过to_dict应用于转置数据帧来完成:

>>> df
  col1 col2
0    A  foo
1    B  bar
2    C  foo
3    A  bar
4    A  foo
5    B  bar
>>> df.T.to_dict().values()
[{'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}, {'col2': 'foo', 'col1': 'C'}, {'col2': 'bar', 'col1': 'A'}, {'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}]


Since scikit-learn 0.13.0 (Jan 3, 2014) there is a new parameter 'records'for the to_dict()method available, so now you can simple use this method without additional manipulations:

自 scikit-learn 0.13.0(2014 年 1 月 3 日)以来'records',该to_dict()方法有一个新参数可用,因此现在您可以简单地使用此方法而无需额外操作:

>>> df = pandas.DataFrame({'col1': ['A', 'B', 'C', 'A', 'A', 'B'], 'col2': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar']})
>>> df
  col1 col2
0    A  foo
1    B  bar
2    C  foo
3    A  bar
4    A  foo
5    B  bar
>>> df.to_dict('records')
[{'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}, {'col2': 'foo', 'col1': 'C'}, {'col2': 'bar', 'col1': 'A'}, {'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}]

回答by Matt

Take a look at sklearn-pandaswhich provides exactly what you're looking for. The corresponding Github repo is here.

看看sklearn-pandas它提供了你正在寻找的东西。相应的 Github 存储库在这里

回答by foglerit

You want to build a design matrix from a pandas DataFrame containing categoricals (or simply strings) and the easiest way to do it is using patsy, a library that replicates and expands R formulas functionality.

您想从包含分类(或只是字符串)的 Pandas DataFrame 构建设计矩阵,最简单的方法是使用patsy,这是一个复制和扩展 R 公式功能的库。

Using your example, the conversion would be:

使用您的示例,转换将是:

import pandas as pd
import patsy

my_df = pd.DataFrame({'col1':['A', 'B', 'C', 'A', 'A', 'B'], 
                      'col2':['foo', 'bar', 'something', 'foo', 'bar', 'foo']})

patsy.dmatrix('col1 + col2', data=my_df) # With added intercept
patsy.dmatrix('0 + col1 + col2', data=my_df) # Without added intercept

The resulting design matrices are just NumPy arrays with some extra information and can be directly used in scikit-learn.

生成的设计矩阵只是带有一些额外信息的 NumPy 数组,可以直接在 scikit-learn 中使用。

Example result with intercept added:

添加了拦截的示例结果:

DesignMatrix with shape (6, 5)
  Intercept  col1[T.B]  col1[T.C]  col2[T.foo]  col2[T.something]
          1          0          0            1                  0
          1          1          0            0                  0
          1          0          1            0                  1
          1          0          0            1                  0
          1          0          0            0                  0
          1          1          0            1                  0
  Terms:
    'Intercept' (column 0)
    'col1' (columns 1:3)
    'col2' (columns 3:5)

Note that patsy tried to avoid multicolinearity by incorporating the effects of Aand barinto the intercept. That way, for example, the col1[T.B]predictor should be interpreted as the additional effect of Bin relation to observations that are classified as A.

需要注意的是懦夫通过合并的影响,尽量避免多重共A并且bar进入拦截。这样,例如,col1[T.B]预测变量应该被解释为与B分类为 的观察相关的附加效应A

回答by Kris

You can definitely use DictVectorizer. Because DictVectorizerexpects an iterable of dict-like objects, you could do the following:

你绝对可以使用DictVectorizer. 因为DictVectorizer期望可迭代的dict-like 对象,您可以执行以下操作:

from sklearn.base import TransformerMixin
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction import DictVectorizer


class RowIterator(TransformerMixin):
    """ Prepare dataframe for DictVectorizer """
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return (row for _, row in X.iterrows())


vectorizer = make_pipeline(RowIterator(), DictVectorizer())

# now you can use vectorizer as you might expect, e.g.
vectorizer.fit_transform(df)