在 Pandas 中反转“one-hot”编码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38334296/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:33:56  来源:igfitidea点击:

Reversing 'one-hot' encoding in Pandas

pythonpandasnumpydataframe

提问by Peadar Coyle

Problem statementI want to go from this data frame which is basically one hot encoded.

问题陈述我想从这个数据帧开始,它基本上是一个热编码。

 In [2]: pd.DataFrame({"monkey":[0,1,0],"rabbit":[1,0,0],"fox":[0,0,1]})

    Out[2]:
       fox  monkey  rabbit
    0    0       0       1
    1    0       1       0
    2    1       0       0
    3    0       0       0
    4    0       0       0

To this one which is 'reverse' one-hot encoded.

对于这个是“反向”one-hot 编码的。

    In [3]: pd.DataFrame({"animal":["monkey","rabbit","fox"]})
    Out[3]:
       animal
    0  monkey
    1  rabbit
    2     fox

I imagine there's some sort of clever use of apply or zip to do thins but I'm not sure how... Can anyone help?

我想有一些巧妙地使用 apply 或 zip 来做薄,但我不知道如何......任何人都可以帮忙吗?

I've not had much success using indexing etc to try to solve this problem.

我使用索引等尝试解决这个问题并没有取得太大的成功。

采纳答案by PYOak

I would use apply to decode the columns:

我会使用 apply 来解码列:

In [2]: animals = pd.DataFrame({"monkey":[0,1,0,0,0],"rabbit":[1,0,0,0,0],"fox":[0,0,1,0,0]})

In [3]: def get_animal(row):
   ...:     for c in animals.columns:
   ...:         if row[c]==1:
   ...:             return c

In [4]: animals.apply(get_animal, axis=1)
Out[4]: 
0    rabbit
1    monkey
2       fox
3      None
4      None
dtype: object

回答by MaxU

UPDATE:i think ayhanis right and it should be:

更新:我认为ayhan是对的,应该是:

df.idxmax(axis=1)

Demo:

演示:

In [40]: s = pd.Series(['dog', 'cat', 'dog', 'bird', 'fox', 'dog'])

In [41]: s
Out[41]:
0     dog
1     cat
2     dog
3    bird
4     fox
5     dog
dtype: object

In [42]: pd.get_dummies(s)
Out[42]:
   bird  cat  dog  fox
0   0.0  0.0  1.0  0.0
1   0.0  1.0  0.0  0.0
2   0.0  0.0  1.0  0.0
3   1.0  0.0  0.0  0.0
4   0.0  0.0  0.0  1.0
5   0.0  0.0  1.0  0.0

In [43]: pd.get_dummies(s).idxmax(1)
Out[43]:
0     dog
1     cat
2     dog
3    bird
4     fox
5     dog
dtype: object

OLD answer:(most probably, incorrect answer)

旧答案:(很可能是错误答案)

try this:

尝试这个:

In [504]: df.idxmax().reset_index().rename(columns={'index':'animal', 0:'idx'})
Out[504]:
   animal  idx
0     fox    2
1  monkey    1
2  rabbit    0

data:

数据:

In [505]: df
Out[505]:
   fox  monkey  rabbit
0    0       0       1
1    0       1       0
2    1       0       0
3    0       0       0
4    0       0       0

回答by piRSquared

I'd do:

我会做:

cols = df.columns.to_series().values
pd.DataFrame(np.repeat(cols[None, :], len(df), 0)[df.astype(bool).values], df.index[df.any(1)])

enter image description here

在此处输入图片说明



Timing

定时

MaxU's method has edge for large dataframes

MaxU 的方法对大型数据帧具有优势

Small df5 x 3

df5 x 3

enter image description here

在此处输入图片说明

Large df1000000 x 52

大号df1000000 x 52

enter image description here

在此处输入图片说明

回答by Sudharshann D

This works with both single and multiple labels.

这适用于单个和多个标签。

We can use advanced indexing to tackle this problem. Hereis the link.

我们可以使用高级索引来解决这个问题。是链接。

import pandas as pd

df = pd.DataFrame({"monkey":[1,1,0,1,0],"rabbit":[1,1,1,1,0],\
    "fox":[1,0,1,0,0], "cat":[0,0,0,0,1]})

df['tags']='' # to create an empty column

for col_name in df.columns:
    df.ix[df[col_name]==1,'tags']= df['tags']+' '+col_name

print df

And the result is:

结果是:

   cat  fox  monkey  rabbit                tags
0    0    1       1       1   fox monkey rabbit
1    0    0       1       1       monkey rabbit
2    0    1       0       1          fox rabbit
3    0    0       1       1       monkey rabbit
4    1    0       0       0                 cat

Explanation: We iterate over the columns on the dataframe.

说明:我们遍历数据帧上的列。

df.ix[selection criteria, columns to write value] = value
df.ix[df[col_name]==1,'tags']= df['tags']+' '+col_name

The above line basically finds you all the places where df[col_name] == 1, selects column 'tags' and set it to the RHS value which is df['tags']+' '+ col_name

上面的行基本上可以找到 df[col_name] == 1 的所有位置,选择列 'tags' 并将其设置为 df['tags']+' '+ col_name 的 RHS 值

Note:.ixhas been deprecated since Pandas v0.20. You should instead use .locor .iloc, as appropriate.

注意:.ix自 Pandas v0.20 以来已被弃用。您应该改用.loc.iloc,视情况而定。

回答by Merlin

Try this:

尝试这个:

df = pd.DataFrame({"monkey":[0,1,0,1,0],"rabbit":[1,0,0,0,0],"fox":[0,0,1,0,0], "cat":[0,0,0,0,1]})
df 

   cat  fox  monkey  rabbit
0    0    0       0       1
1    0    0       1       0
2    0    1       0       0
3    0    0       1       0
4    1    0       0       0

pd.DataFrame([x for x in np.where(df ==1, df.columns,'').flatten().tolist() if len(x) >0],columns= (["animal"]) )

   animal
0  rabbit
1  monkey
2     fox
3  monkey
4     cat

回答by conflicted_user

You could try using melt(). This method also works when you have multiple OHE labels for a row.

您可以尝试使用melt(). 当一行有多个 OHE 标签时,此方法也适用。

# Your OHE dataframe 
df = pd.DataFrame({"monkey":[0,1,0],"rabbit":[1,0,0],"fox":[0,0,1]})

mel = df.melt(var_name=['animal'], value_name='value') # Melting

mel[mel.value == 1].reset_index(drop=True) # this gives you the result 

回答by Shakeeb Pasha

It can be achieved with a simple apply on dataframe

它可以通过对数据框的简单应用来实现

# function to get column name with value one for each row in dataframe
def get_animal(row):
    return(row.index[row.apply(lambda x: x==1)][0])

# prepare a animal column
df['animal'] = df.apply(lambda row:get_animal(row), axis=1)