Python 为 Scikit-Learn 向量化 Pandas 数据框

Question

提问by Amelio Vazquez-Reina

Say I have a dataframe in Pandas like the following:

假设我在 Pandas 中有一个数据框，如下所示：

> my_dataframe

col1   col2
A      foo
B      bar
C      something
A      foo
A      bar
B      foo

where rows represent instances, and columns input features (not showing the target label, but this would be for a classification task), i.e. I trying to build Xout of my_dataframe.

其中，行代表实例，并试图构建列输入功能（未显示有目标的标签，但这将是一个分类任务），也就是我X出来的my_dataframe。

How can I vectorize this efficiently using e.g. DictVectorizer?

如何使用 eg 有效地矢量化DictVectorizer？

Do I need to convert each and every entry in my DataFrame to a dictionary first? (that's the way it is done in the example in the link above). Is there a more efficient way to do this?

我是否需要先将 DataFrame 中的每个条目都转换为字典？（这是在上面链接中的示例中完成的方式）。有没有更有效的方法来做到这一点？

Answer 1

回答by alko

First, I don't get where in your sample array are features, and where observations.

首先，我不知道您的样本数组中的哪些位置是特征，以及观察结果的位置。

Second, DictVectorizerholds no data, and is only about transformation utility and metadata storage. After transformation it stores features names and mapping. It returns a numpy array, used for further computations. Numpy array (features matrix) size equals to features countx number of observations, with values equal to feature value for an observation. So if you know your observations and features, you can create this array any other way you like.

其次，不DictVectorizer持有数据，仅与转换实用程序和元数据存储有关。转换后，它存储功能名称和映射。它返回一个 numpy 数组，用于进一步计算。Numpy 数组（特征矩阵）大小等于features countx number of observations，其值等于观察的特征值。因此，如果您知道自己的观察结果和特征，则可以按照您喜欢的任何其他方式创建此数组。

In case you expect sklearn do it for you, you don't have to reconstruct dict manually, as it can be done with to_dictapplied to transposed dataframe:

如果您希望 sklearn 为您做这件事，您不必手动重建 dict，因为它可以通过to_dict应用于转置数据帧来完成：

>>> df
  col1 col2
0    A  foo
1    B  bar
2    C  foo
3    A  bar
4    A  foo
5    B  bar
>>> df.T.to_dict().values()
[{'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}, {'col2': 'foo', 'col1': 'C'}, {'col2': 'bar', 'col1': 'A'}, {'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}]

Since scikit-learn 0.13.0 (Jan 3, 2014) there is a new parameter 'records'for the to_dict()method available, so now you can simple use this method without additional manipulations:

自 scikit-learn 0.13.0（2014 年 1 月 3 日）以来'records'，该to_dict()方法有一个新参数可用，因此现在您可以简单地使用此方法而无需额外操作：

>>> df = pandas.DataFrame({'col1': ['A', 'B', 'C', 'A', 'A', 'B'], 'col2': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar']})
>>> df
  col1 col2
0    A  foo
1    B  bar
2    C  foo
3    A  bar
4    A  foo
5    B  bar
>>> df.to_dict('records')
[{'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}, {'col2': 'foo', 'col1': 'C'}, {'col2': 'bar', 'col1': 'A'}, {'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}]

Answer 2

回答by Matt

Take a look at sklearn-pandaswhich provides exactly what you're looking for. The corresponding Github repo is here.

看看sklearn-pandas它提供了你正在寻找的东西。相应的 Github 存储库在这里。

Answer 3

回答by foglerit

You want to build a design matrix from a pandas DataFrame containing categoricals (or simply strings) and the easiest way to do it is using patsy, a library that replicates and expands R formulas functionality.

您想从包含分类（或只是字符串）的 Pandas DataFrame 构建设计矩阵，最简单的方法是使用patsy，这是一个复制和扩展 R 公式功能的库。

Using your example, the conversion would be:

使用您的示例，转换将是：

import pandas as pd
import patsy

my_df = pd.DataFrame({'col1':['A', 'B', 'C', 'A', 'A', 'B'], 
                      'col2':['foo', 'bar', 'something', 'foo', 'bar', 'foo']})

patsy.dmatrix('col1 + col2', data=my_df) # With added intercept
patsy.dmatrix('0 + col1 + col2', data=my_df) # Without added intercept

The resulting design matrices are just NumPy arrays with some extra information and can be directly used in scikit-learn.

生成的设计矩阵只是带有一些额外信息的 NumPy 数组，可以直接在 scikit-learn 中使用。

Example result with intercept added:

添加了拦截的示例结果：

DesignMatrix with shape (6, 5)
  Intercept  col1[T.B]  col1[T.C]  col2[T.foo]  col2[T.something]
          1          0          0            1                  0
          1          1          0            0                  0
          1          0          1            0                  1
          1          0          0            1                  0
          1          0          0            0                  0
          1          1          0            1                  0
  Terms:
    'Intercept' (column 0)
    'col1' (columns 1:3)
    'col2' (columns 3:5)

Note that patsy tried to avoid multicolinearity by incorporating the effects of Aand barinto the intercept. That way, for example, the col1[T.B]predictor should be interpreted as the additional effect of Bin relation to observations that are classified as A.

需要注意的是懦夫通过合并的影响，尽量避免多重共A并且bar进入拦截。这样，例如，col1[T.B]预测变量应该被解释为与B分类为的观察相关的附加效应A。

Answer 4

回答by Kris

You can definitely use DictVectorizer. Because DictVectorizerexpects an iterable of dict-like objects, you could do the following:

你绝对可以使用DictVectorizer. 因为DictVectorizer期望可迭代的dict-like 对象，您可以执行以下操作：

from sklearn.base import TransformerMixin
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction import DictVectorizer


class RowIterator(TransformerMixin):
    """ Prepare dataframe for DictVectorizer """
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return (row for _, row in X.iterrows())


vectorizer = make_pipeline(RowIterator(), DictVectorizer())

# now you can use vectorizer as you might expect, e.g.
vectorizer.fit_transform(df)

Python 为 Scikit-Learn 向量化 Pandas 数据框

提问by Amelio Vazquez-Reina

回答by alko

回答by Matt

回答by foglerit

回答by Kris

相关推荐

最近更新

标签

Python 为 Scikit-Learn 向量化 Pandas 数据框

提问by Amelio Vazquez-Reina

回答by alko

回答by Matt

回答by foglerit

回答by Kris

相关推荐

python中的一行ftp服务器

Python 仅从此元素中提取文本，而不是其子元素

在 Python 中，我如何知道进程何时完成？

Python 如何创建一个旋转的命令行光标？

相关推荐

最近更新

标签