如何离散化 Pandas DataFrame 中的值并转换为二进制矩阵？

Question

提问by Uri Laserson

I mean something like this:

我的意思是这样的：

I have a DataFramewith columns that may be categorical or nominal. For each observation (row), I want to generate a new row where every possible value for the variables is now its own binary variable. For example, this matrix (first row is column labels)

我有一个DataFrame可能是分类或名义的列。对于每个观察（行），我想生成一个新行，其中变量的每个可能值现在都是它自己的二进制变量。例如，这个矩阵（第一行是列标签）

'a'     'b'     'c'
one     0.2     0
two     0.4     1
two     0.9     0
three   0.1     2
one     0.0     4
two     0.2     5

would be converted into something like this:

将被转换成这样的：

'a'              'b'                                                    'c'
one  two  three  [0.0,0.2)  [0.2,0.4)  [0.4,0.6)  [0.6,0.8)  [0.8,1.0]   0   1   2   3   4   5

 1    0     0        0          1          0          0          0       1   0   0   0   0   0
 0    1     0        0          0          0          0          1       0   1   0   0   0   0
 0    1     0        0          0          0          0          1       1   0   0   0   0   0
 0    0     1        1          0          0          0          0       0   0   1   0   0   0
 1    0     0        1          0          0          0          0       0   0   0   0   1   0
 0    1     0        0          1          0          0          0       0   0   0   0   0   1

Each variable (column) in the initial matrix get binned into all the possible values. If it's categorical, then each possible value becomes a new column. If it's a float, then the values are binned some way (say, always splitting into 10 bins). If it's an int, then it can be every possibel int value, or perhaps also binning.

初始矩阵中的每个变量（列）都被分箱为所有可能的值。如果它是分类的，那么每个可能的值都会成为一个新列。如果它是一个浮点数，那么这些值会以某种方式分箱（例如，总是分成 10 个分箱）。如果它是一个 int，那么它可以是每个可能的 int 值，或者也可以是分箱。

FYI: in my real application, the table has up to 2 million rows, and the full "expanded" matrix may have hundreds of columns.

仅供参考：在我的实际应用中，该表最多有 200 万行，完整的“扩展”矩阵可能有数百列。

Is there an easy way to perform this operation?

有没有简单的方法来执行这个操作？

Separately, I would also be willing to skip this step, as I am really trying to compute a Burt table (which is a symmetric matrix of the cross-tabulations). Is there an easy way to do something similar with the crosstabfunction? Otherwise, computing the cross tabulation is just a simple matrix multiplication.

另外，我也愿意跳过这一步，因为我真的想计算一个 Burt 表（它是交叉表的对称矩阵）。有没有一种简单的方法可以对crosstab函数做类似的事情？否则，计算交叉表只是一个简单的矩阵乘法。

Answer 1

采纳答案by lbolla

You can use some kind of broadcasting:

您可以使用某种广播：

    In [58]: df
    Out[58]:
           a    b  c
    0    one  0.2  0
    1    two  0.4  1
    2    two  0.9  0
    3  three  0.1  2
    4    one  0.0  4
    5    two  0.2  5

    In [41]: (df.a.values[:,numpy.newaxis] == df.a.unique()).astype(int)
    Out[41]:
    array([[1, 0, 0],
           [0, 1, 0],
           [0, 1, 0],
           [0, 0, 1],
           [1, 0, 0],
           [0, 1, 0]])

    In [54]: ((0 <= df.b.values[:,numpy.newaxis]) & (df.b.values[:,numpy.newaxis] < 0.2)).astype(int)
    Out[54]:
    array([[0],
           [0],
           [0],
           [1],
           [1],
           [0]])

    In [59]: (df.c.values[:,numpy.newaxis] == df.c.unique()).astype(int)
    Out[59]:
    array([[1, 0, 0, 0, 0],
           [0, 1, 0, 0, 0],
           [1, 0, 0, 0, 0],
           [0, 0, 1, 0, 0],
           [0, 0, 0, 1, 0],
           [0, 0, 0, 0, 1]])

And then join all the pieces together with pandas.concator similar.

然后将所有部件连接在一起pandas.concat或类似。

Answer 2

回答by Wes McKinney

Note that I have implemented new cutand qcutfunctions for discretizing continuous data:

请注意，我已经实现新的cut和qcut功能离散连续的数据：

http://pandas-docs.github.io/pandas-docs-travis/basics.html#discretization-and-quantiling

Answer 3

回答by wonderkid2

For labeled columns like the aand ccolumn in your example you can use the pandas build-in method get_dummies().

对于像示例中的a和列这样的标记列c，您可以使用熊猫内置方法get_dummies()。

Ex.:

前任。：

import pandas as pd
s1 = ['a', 'b', np.nan]
pd.get_dummies(s1)
       a  b
    0  1  0
    1  0  1
    2  0  0

Answer 4

回答by elyase

I doubt you will beat patsy's simplicity. It was designed precisely for this task:

我怀疑你会打败patsy的简单性。它专为此任务而设计：

>>> from patsy import dmatrix
>>> dmatrix('C(a) + C(b) + C(c) - 1', df, return_type='dataframe')

   C(a)[one]  C(a)[three]  C(a)[two]  C(b)[T.0.1]  C(b)[T.0.2]  C(b)[T.0.4]   C(b)[T.0.9]  C(c)[T.1]  C(c)[T.2]  C(c)[T.4]  C(c)[T.5]  
0          1            0          0            0            1            0             0          0          0          0          0  
1          0            0          1            0            0            1             0          1          0          0          0  
2          0            0          1            0            0            0             1          0          0          0          0  
3          0            1          0            1            0            0             0          0          1          0          0  
4          1            0          0            0            0            0             0          0          0          1          0  
5          0            0          1            0            1            0             0          0          0          0          1

Here the C(a)means convert the variable to categorical and the -1is to avoid outputting an intercept column.

这里的C(a)意思是将变量转换为分类变量-1，避免输出截距列。

Answer 5

回答by Tim

Putting together a couple of other comments into a single response answering OPs questions.

将一些其他评论放在一个回答 OP 问题的单一回复中。

d = {'a' : pd.Series(['one', 'two', 'two', 'three', 'one', 'two']), 
     'b' : pd.Series([0.2, 0.4, 0.9, 0.1, 0.0, 0.2]),
     'c' : pd.Series([0, 1, 0, 2, 4, 5]) }

data = pd.DataFrame(d)
a_cols = pd.crosstab(data.index, [data.a])
b_bins = pd.cut(data.b, [0.0, 0.2, 0.4, 0.6, 0.8, 1.0], right=False)
b_cols = pd.crosstab(data.index, b_bins)
c_cols = pd.crosstab(data.index, [data.c], )
new_data = a_cols.join(b_cols).join(c_cols)
new_data.index.names = ['']
print new_data.to_string()

"""
       one  three  two  [0, 0.2)  [0.2, 0.4)  [0.4, 0.6)  [0.8, 1)  0  1  2  4  5

    0    1      0    0         0           1           0         0  1  0  0  0  0
    1    0      0    1         0           0           1         0  0  1  0  0  0
    2    0      0    1         0           0           0         1  1  0  0  0  0
    3    0      1    0         1           0           0         0  0  0  1  0  0
    4    1      0    0         1           0           0         0  0  0  0  1  0
    5    0      0    1         0           1           0         0  0  0  0  0  1
"""

如何离散化 Pandas DataFrame 中的值并转换为二进制矩阵？

提问by Uri Laserson

采纳答案by lbolla

回答by Wes McKinney

回答by wonderkid2

回答by elyase

回答by Tim

相关推荐

最近更新

标签

如何离散化 Pandas DataFrame 中的值并转换为二进制矩阵？

提问by Uri Laserson

采纳答案by lbolla

回答by Wes McKinney

回答by wonderkid2

回答by elyase

回答by Tim

相关推荐

相当于 WPF dotnet core 中的 UserSettings / ApplicationSettings

pandas 在 Python 中计算复合收益系列

按升序对 Pandas DataMatrix 进行排序

如何使用 Pandas 获得两个时间序列之间的相关性

相关推荐

最近更新

标签