pandas 反转熊猫中的 get_dummies 编码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50607740/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:37:36  来源:igfitidea点击:

Reverse a get_dummies encoding in pandas

pythonpandasdataframe

提问by MukundS

Column names are: ID,1,2,3,4,5,6,7,8,9.

列名是:ID、1、2、3、4、5、6、7、8、9。

The col values are either 0 or 1

col 值为 0 或 1

My dataframe looks like this:

我的数据框如下所示:

 ID     1    2    3    4    5    6   7   8   9 

1002    0    1    0    1    0    0   0   0   0
1003    0    0    0    0    0    0   0   0   0 
1004    1    1    0    0    0    0   0   0   0
1005    0    0    0    0    1    0   0   0   0
1006    0    0    0    0    0    1   0   0   0
1007    1    0    1    0    0    0   0   0   0
1000    0    0    0    0    0    0   0   0   0
1009    0    0    1    0    0    0   1   0   0

I want the column names in front of the ID where the value in a row is 1.

我想要行中值为 1 的 ID 前面的列名。

The Dataframe i want should look like this:

我想要的数据框应该是这样的:

 ID      Col2
1002       2    // has 1 at Col(2) and Col(4)
1002       4    
1004       1    // has 1 at col(1) and col(2)
1004       2
1005       5    // has 1 at col(5)
1006       6    // has 1 at col(6)
1007       1    // has 1 at col(1) and col(3)
1007       3
1009       3    // has 1 at col(3) and col(7)
1009       7

Please help me in this, Thanks in advance

请帮助我,在此先感谢

采纳答案by YOBEN_S

set_index+ stack, stack will dropna by default

set_index+ stack, 默认情况下堆栈会丢弃

df.set_index('ID',inplace=True)

df[df==1].stack().reset_index().drop(0,1)
Out[363]: 
     ID level_1
0  1002       2
1  1002       4
2  1004       1
3  1004       2
4  1005       5
5  1006       6
6  1007       1
7  1007       3
8  1009       3
9  1009       7

回答by cs95

np.argwhere

np.argwhere

v = np.argwhere(df.drop('ID', 1).values).T
pd.DataFrame({'ID' : df.loc[v[0], 'ID'], 'Col2' : df.columns[1:][v[1]]})

  Col2    ID
0    2  1002
0    4  1002
2    1  1004
2    2  1004
3    5  1005
4    6  1006
5    1  1007
5    3  1007
7    3  1009
7    7  1009

argwheregets the i, j indices of all non-zero elements in your DataFrame. Use the first column of indices to index into column ID, and the second column of indices to index into df.columns.

argwhere获取 DataFrame 中所有非零元素的 i, j 索引。使用第一列索引索引到 column ID,使用第二列索引索引到df.columns

I transpose vbefore step 2 for cache efficiency, and less typing.

v在第 2 步之前转置以提高缓存效率,并减少打字。

回答by jezrael

Use:

用:

df = (df.melt('ID', var_name='Col2')
       .query('value== 1')
       .sort_values(['ID', 'Col2'])
       .drop('value',1))

Alternative solution:

替代解决方案:

df = (df.set_index('ID')
        .mask(lambda x: x == 0)
        .stack()
        .reset_index()
        .drop(0,1))


print (df)
      ID Col2
8   1002    2
24  1002    4
2   1004    1
10  1004    2
35  1005    5
44  1006    6
5   1007    1
21  1007    3
23  1009    3
55  1009    7

Explanation:

说明

  1. First reshape values by meltor set_indexwith unstack

  2. Filter only 1by queryor convert 0to NaNs by mask

  3. sort_valuesfor first solution

  4. create columns from MultiIndexby reset_index

  5. Last remove unnecessary columns by drop

  1. 首先通过meltset_index使用重塑值unstack

  2. 过滤器仅1通过query或转换0NaNS按mask

  3. sort_values对于第一个解决方案

  4. 从创建列MultiIndexreset_index

  5. 最后删除不必要的列 drop

回答by Zeel Bharatkumar Patel 1931006

you can just use idxmax over columns to reverse pd.get_dummies like

您可以在列上使用 idxmax 来反转 pd.get_dummies 之类的

one_hot_encoded = pd.get_dummies(original)
original_back = one_hot_encoded.idxmax(axis=1)

回答by Mahomet

Several great answers for the OP post. However, often get_dummiesis used for multiple categorical features. Pandas uses a prefix separator prefix_septo distinguish different values for a column.

OP 帖子的几个很好的答案。但是,通常get_dummies用于多个分类特征。Pandas 使用前缀分隔符prefix_sep来区分列的不同值。

The following function collapses a "dummified" dataframe while keeping the order of columns:

以下函数在保持列顺序的同时折叠“虚拟”数据框:

def undummify(df, prefix_sep="_"):
    cols2collapse = {
        item.split(prefix_sep)[0]: (prefix_sep in item) for item in df.columns
    }
    series_list = []
    for col, needs_to_collapse in cols2collapse.items():
        if needs_to_collapse:
            undummified = (
                df.filter(like=col)
                .idxmax(axis=1)
                .apply(lambda x: x.split(prefix_sep, maxsplit=1)[1])
                .rename(col)
            )
            series_list.append(undummified)
        else:
            series_list.append(df[col])
    undummified_df = pd.concat(series_list, axis=1)
    return undummified_df

Example

例子

>>> df
     a    b    c
0  A_1  B_1  C_1
1  A_2  B_2  C_2
>>> df2 = pd.get_dummies(df)
>>> df2
   a_A_1  a_A_2  b_B_1  b_B_2  c_C_1  c_C_2
0      1      0      1      0      1      0
1      0      1      0      1      0      1
>>> df3 = undummify(df2)
>>> df3
     a    b    c
0  A_1  B_1  C_1
1  A_2  B_2  C_2

回答by Tarun Bhavnani

https://stackoverflow.com/a/55757342/2384397

https://stackoverflow.com/a/55757342/2384397

rewriting here: Converting dat["classification"] to one hot encodes and back!!

在这里重写:将 dat["classification"] 转换为一个热编码并返回!!

import pandas as pd

将Pandas导入为 pd

from sklearn.preprocessing import LabelEncoder

从 sklearn.preprocessing 导入 LabelEncoder

dat["labels"]= le.fit_transform(dat["classification"])

数据[“标签”]= le.fit_transform(数据[“分类”])

Y= pd.get_dummies(dat["labels"])

Y= pd.get_dummies(dat["labels"])

tru=[]

真=[]

for i in range(0, len(Y)): tru.append(np.argmax(Y.iloc[i]))

对于 i in range(0, len(Y)): tru.append(np.argmax(Y.iloc[i]))

tru= le.inverse_transform(tru)

tru= le.inverse_transform(tru)

Identical check! (tru==dat["classification"]).value_counts()

一模一样的检查!(tru==dat["分类"]).value_counts()