如何离散化 Pandas DataFrame 中的值并转换为二进制矩阵?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10791661/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 15:43:44  来源:igfitidea点击:

How do I discretize values in a pandas DataFrame and convert to a binary matrix?

pythonpandasdataframe

提问by Uri Laserson

I mean something like this:

我的意思是这样的:

I have a DataFramewith columns that may be categorical or nominal. For each observation (row), I want to generate a new row where every possible value for the variables is now its own binary variable. For example, this matrix (first row is column labels)

我有一个DataFrame可能是分类或名义的列。对于每个观察(行),我想生成一个新行,其中变量的每个可能值现在都是它自己的二进制变量。例如,这个矩阵(第一行是列标签)

'a'     'b'     'c'
one     0.2     0
two     0.4     1
two     0.9     0
three   0.1     2
one     0.0     4
two     0.2     5

would be converted into something like this:

将被转换成这样的:

'a'              'b'                                                    'c'
one  two  three  [0.0,0.2)  [0.2,0.4)  [0.4,0.6)  [0.6,0.8)  [0.8,1.0]   0   1   2   3   4   5

 1    0     0        0          1          0          0          0       1   0   0   0   0   0
 0    1     0        0          0          0          0          1       0   1   0   0   0   0
 0    1     0        0          0          0          0          1       1   0   0   0   0   0
 0    0     1        1          0          0          0          0       0   0   1   0   0   0
 1    0     0        1          0          0          0          0       0   0   0   0   1   0
 0    1     0        0          1          0          0          0       0   0   0   0   0   1

Each variable (column) in the initial matrix get binned into all the possible values. If it's categorical, then each possible value becomes a new column. If it's a float, then the values are binned some way (say, always splitting into 10 bins). If it's an int, then it can be every possibel int value, or perhaps also binning.

初始矩阵中的每个变量(列)都被分箱为所有可能的值。如果它是分类的,那么每个可能的值都会成为一个新列。如果它是一个浮点数,那么这些值会以某种方式分箱(例如,总是分成 10 个分箱)。如果它是一个 int,那么它可以是每个可能的 int 值,或者也可以是分箱。

FYI: in my real application, the table has up to 2 million rows, and the full "expanded" matrix may have hundreds of columns.

仅供参考:在我的实际应用中,该表最多有 200 万行,完整的“扩展”矩阵可能有数百列。

Is there an easy way to perform this operation?

有没有简单的方法来执行这个操作?

Separately, I would also be willing to skip this step, as I am really trying to compute a Burt table (which is a symmetric matrix of the cross-tabulations). Is there an easy way to do something similar with the crosstabfunction? Otherwise, computing the cross tabulation is just a simple matrix multiplication.

另外,我也愿意跳过这一步,因为我真的想计算一个 Burt 表(它是交叉表的对称矩阵)。有没有一种简单的方法可以对crosstab函数做类似的事情?否则,计算交叉表只是一个简单的矩阵乘法。

采纳答案by lbolla

You can use some kind of broadcasting:

您可以使用某种广播:

    In [58]: df
    Out[58]:
           a    b  c
    0    one  0.2  0
    1    two  0.4  1
    2    two  0.9  0
    3  three  0.1  2
    4    one  0.0  4
    5    two  0.2  5

    In [41]: (df.a.values[:,numpy.newaxis] == df.a.unique()).astype(int)
    Out[41]:
    array([[1, 0, 0],
           [0, 1, 0],
           [0, 1, 0],
           [0, 0, 1],
           [1, 0, 0],
           [0, 1, 0]])

    In [54]: ((0 <= df.b.values[:,numpy.newaxis]) & (df.b.values[:,numpy.newaxis] < 0.2)).astype(int)
    Out[54]:
    array([[0],
           [0],
           [0],
           [1],
           [1],
           [0]])

    In [59]: (df.c.values[:,numpy.newaxis] == df.c.unique()).astype(int)
    Out[59]:
    array([[1, 0, 0, 0, 0],
           [0, 1, 0, 0, 0],
           [1, 0, 0, 0, 0],
           [0, 0, 1, 0, 0],
           [0, 0, 0, 1, 0],
           [0, 0, 0, 0, 1]])

And then join all the pieces together with pandas.concator similar.

然后将所有部件连接在一起pandas.concat或类似。

回答by Wes McKinney

Note that I have implemented new cutand qcutfunctions for discretizing continuous data:

请注意,我已经实现新的cutqcut功能离散连续的数据:

http://pandas-docs.github.io/pandas-docs-travis/basics.html#discretization-and-quantiling

http://pandas-docs.github.io/pandas-docs-travis/basics.html#discretization-and-quantiling

回答by wonderkid2

For labeled columns like the aand ccolumn in your example you can use the pandas build-in method get_dummies().

对于像示例中的a和列这样的标记列c,您可以使用熊猫内置方法get_dummies()

Ex.:

前任。:

import pandas as pd
s1 = ['a', 'b', np.nan]
pd.get_dummies(s1)
       a  b
    0  1  0
    1  0  1
    2  0  0

回答by elyase

I doubt you will beat patsy's simplicity. It was designed precisely for this task:

我怀疑你会打败patsy的简单性。它专为此任务而设计:

>>> from patsy import dmatrix
>>> dmatrix('C(a) + C(b) + C(c) - 1', df, return_type='dataframe')

   C(a)[one]  C(a)[three]  C(a)[two]  C(b)[T.0.1]  C(b)[T.0.2]  C(b)[T.0.4]   C(b)[T.0.9]  C(c)[T.1]  C(c)[T.2]  C(c)[T.4]  C(c)[T.5]  
0          1            0          0            0            1            0             0          0          0          0          0  
1          0            0          1            0            0            1             0          1          0          0          0  
2          0            0          1            0            0            0             1          0          0          0          0  
3          0            1          0            1            0            0             0          0          1          0          0  
4          1            0          0            0            0            0             0          0          0          1          0  
5          0            0          1            0            1            0             0          0          0          0          1  

Here the C(a)means convert the variable to categorical and the -1is to avoid outputting an intercept column.

这里的C(a)意思是将变量转换为分类变量-1,避免输出截距列。

回答by Tim

Putting together a couple of other comments into a single response answering OPs questions.

将一些其他评论放在一个回答 OP 问题的单一回复中。

d = {'a' : pd.Series(['one', 'two', 'two', 'three', 'one', 'two']), 
     'b' : pd.Series([0.2, 0.4, 0.9, 0.1, 0.0, 0.2]),
     'c' : pd.Series([0, 1, 0, 2, 4, 5]) }

data = pd.DataFrame(d)
a_cols = pd.crosstab(data.index, [data.a])
b_bins = pd.cut(data.b, [0.0, 0.2, 0.4, 0.6, 0.8, 1.0], right=False)
b_cols = pd.crosstab(data.index, b_bins)
c_cols = pd.crosstab(data.index, [data.c], )
new_data = a_cols.join(b_cols).join(c_cols)
new_data.index.names = ['']
print new_data.to_string()

"""
       one  three  two  [0, 0.2)  [0.2, 0.4)  [0.4, 0.6)  [0.8, 1)  0  1  2  4  5

    0    1      0    0         0           1           0         0  1  0  0  0  0
    1    0      0    1         0           0           1         0  0  1  0  0  0
    2    0      0    1         0           0           0         1  1  0  0  0  0
    3    0      1    0         1           0           0         0  0  0  1  0  0
    4    1      0    0         1           0           0         0  0  0  0  1  0
    5    0      0    1         0           1           0         0  0  0  0  0  1
"""