pandas 反转熊猫中的 get_dummies 编码

Question

提问by MukundS

Column names are: ID,1,2,3,4,5,6,7,8,9.

列名是：ID、1、2、3、4、5、6、7、8、9。

The col values are either 0 or 1

col 值为 0 或 1

My dataframe looks like this:

我的数据框如下所示：

 ID     1    2    3    4    5    6   7   8   9 

1002    0    1    0    1    0    0   0   0   0
1003    0    0    0    0    0    0   0   0   0 
1004    1    1    0    0    0    0   0   0   0
1005    0    0    0    0    1    0   0   0   0
1006    0    0    0    0    0    1   0   0   0
1007    1    0    1    0    0    0   0   0   0
1000    0    0    0    0    0    0   0   0   0
1009    0    0    1    0    0    0   1   0   0

I want the column names in front of the ID where the value in a row is 1.

我想要行中值为 1 的 ID 前面的列名。

The Dataframe i want should look like this:

我想要的数据框应该是这样的：

 ID      Col2
1002       2    // has 1 at Col(2) and Col(4)
1002       4    
1004       1    // has 1 at col(1) and col(2)
1004       2
1005       5    // has 1 at col(5)
1006       6    // has 1 at col(6)
1007       1    // has 1 at col(1) and col(3)
1007       3
1009       3    // has 1 at col(3) and col(7)
1009       7

Please help me in this, Thanks in advance

请帮助我，在此先感谢

Answer 1

采纳答案by YOBEN_S

set_index+ stack, stack will dropna by default

set_index+ stack, 默认情况下堆栈会丢弃

df.set_index('ID',inplace=True)

df[df==1].stack().reset_index().drop(0,1)
Out[363]: 
     ID level_1
0  1002       2
1  1002       4
2  1004       1
3  1004       2
4  1005       5
5  1006       6
6  1007       1
7  1007       3
8  1009       3
9  1009       7

Answer 2

回答by cs95

`np.argwhere`

v = np.argwhere(df.drop('ID', 1).values).T
pd.DataFrame({'ID' : df.loc[v[0], 'ID'], 'Col2' : df.columns[1:][v[1]]})

  Col2    ID
0    2  1002
0    4  1002
2    1  1004
2    2  1004
3    5  1005
4    6  1006
5    1  1007
5    3  1007
7    3  1009
7    7  1009

argwheregets the i, j indices of all non-zero elements in your DataFrame. Use the first column of indices to index into column ID, and the second column of indices to index into df.columns.

argwhere获取 DataFrame 中所有非零元素的 i, j 索引。使用第一列索引索引到 column ID，使用第二列索引索引到df.columns。

I transpose vbefore step 2 for cache efficiency, and less typing.

我v在第 2 步之前转置以提高缓存效率，并减少打字。

Answer 3

回答by jezrael

Use:

用：

df = (df.melt('ID', var_name='Col2')
       .query('value== 1')
       .sort_values(['ID', 'Col2'])
       .drop('value',1))

Alternative solution:

替代解决方案：

df = (df.set_index('ID')
        .mask(lambda x: x == 0)
        .stack()
        .reset_index()
        .drop(0,1))

print (df)
      ID Col2
8   1002    2
24  1002    4
2   1004    1
10  1004    2
35  1005    5
44  1006    6
5   1007    1
21  1007    3
23  1009    3
55  1009    7

Explanation:

说明：

First reshape values by meltor set_indexwith unstack
Filter only 1by queryor convert 0to NaNs by mask
sort_valuesfor first solution
create columns from MultiIndexby reset_index
Last remove unnecessary columns by drop

首先通过melt或set_index使用重塑值unstack
过滤器仅1通过query或转换0到NaNS按mask
sort_values对于第一个解决方案
从创建列MultiIndex由reset_index
最后删除不必要的列 drop

Answer 4

回答by Zeel Bharatkumar Patel 1931006

you can just use idxmax over columns to reverse pd.get_dummies like

您可以在列上使用 idxmax 来反转 pd.get_dummies 之类的

one_hot_encoded = pd.get_dummies(original)
original_back = one_hot_encoded.idxmax(axis=1)

Answer 5

回答by Mahomet

Several great answers for the OP post. However, often get_dummiesis used for multiple categorical features. Pandas uses a prefix separator prefix_septo distinguish different values for a column.

OP 帖子的几个很好的答案。但是，通常get_dummies用于多个分类特征。Pandas 使用前缀分隔符prefix_sep来区分列的不同值。

The following function collapses a "dummified" dataframe while keeping the order of columns:

以下函数在保持列顺序的同时折叠“虚拟”数据框：

def undummify(df, prefix_sep="_"):
    cols2collapse = {
        item.split(prefix_sep)[0]: (prefix_sep in item) for item in df.columns
    }
    series_list = []
    for col, needs_to_collapse in cols2collapse.items():
        if needs_to_collapse:
            undummified = (
                df.filter(like=col)
                .idxmax(axis=1)
                .apply(lambda x: x.split(prefix_sep, maxsplit=1)[1])
                .rename(col)
            )
            series_list.append(undummified)
        else:
            series_list.append(df[col])
    undummified_df = pd.concat(series_list, axis=1)
    return undummified_df

Example

例子

>>> df
     a    b    c
0  A_1  B_1  C_1
1  A_2  B_2  C_2
>>> df2 = pd.get_dummies(df)
>>> df2
   a_A_1  a_A_2  b_B_1  b_B_2  c_C_1  c_C_2
0      1      0      1      0      1      0
1      0      1      0      1      0      1
>>> df3 = undummify(df2)
>>> df3
     a    b    c
0  A_1  B_1  C_1
1  A_2  B_2  C_2

Answer 6

回答by Tarun Bhavnani

https://stackoverflow.com/a/55757342/2384397

rewriting here: Converting dat["classification"] to one hot encodes and back!!

在这里重写：将 dat["classification"] 转换为一个热编码并返回！！

import pandas as pd

将Pandas导入为 pd

from sklearn.preprocessing import LabelEncoder

从 sklearn.preprocessing 导入 LabelEncoder

dat["labels"]= le.fit_transform(dat["classification"])

数据[“标签”]= le.fit_transform（数据[“分类”]）

Y= pd.get_dummies(dat["labels"])

tru=[]

真=[]

for i in range(0, len(Y)): tru.append(np.argmax(Y.iloc[i]))

对于 i in range(0, len(Y)): tru.append(np.argmax(Y.iloc[i]))

tru= le.inverse_transform(tru)

Identical check! (tru==dat["classification"]).value_counts()

一模一样的检查！(tru==dat["分类"]).value_counts()

pandas 反转熊猫中的 get_dummies 编码

提问by MukundS

采纳答案by YOBEN_S

回答by cs95

`np.argwhere`

`np.argwhere`

回答by jezrael

回答by Zeel Bharatkumar Patel 1931006

回答by Mahomet

Example

例子

回答by Tarun Bhavnani

相关推荐

最近更新

标签

pandas 反转熊猫中的 get_dummies 编码

提问by MukundS

采纳答案by YOBEN_S

回答by cs95

np.argwhere

np.argwhere

回答by jezrael

回答by Zeel Bharatkumar Patel 1931006

回答by Mahomet

Example

例子

回答by Tarun Bhavnani

相关推荐

如何将日期时间格式转换为分钟 - pandas

Pandas：查找特定列不是 NA 但所有其他列的行

pandas Python 错误：TypeError：'Timestamp' 类型的对象不是 JSON 可序列化的'

pandas 如何在熊猫中进行前滚求和？

相关推荐

最近更新

标签

`np.argwhere`

`np.argwhere`