pandas 反转熊猫中的 get_dummies 编码
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/50607740/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Reverse a get_dummies encoding in pandas
提问by MukundS
Column names are: ID,1,2,3,4,5,6,7,8,9.
列名是:ID、1、2、3、4、5、6、7、8、9。
The col values are either 0 or 1
col 值为 0 或 1
My dataframe looks like this:
我的数据框如下所示:
ID 1 2 3 4 5 6 7 8 9
1002 0 1 0 1 0 0 0 0 0
1003 0 0 0 0 0 0 0 0 0
1004 1 1 0 0 0 0 0 0 0
1005 0 0 0 0 1 0 0 0 0
1006 0 0 0 0 0 1 0 0 0
1007 1 0 1 0 0 0 0 0 0
1000 0 0 0 0 0 0 0 0 0
1009 0 0 1 0 0 0 1 0 0
I want the column names in front of the ID where the value in a row is 1.
我想要行中值为 1 的 ID 前面的列名。
The Dataframe i want should look like this:
我想要的数据框应该是这样的:
ID Col2
1002 2 // has 1 at Col(2) and Col(4)
1002 4
1004 1 // has 1 at col(1) and col(2)
1004 2
1005 5 // has 1 at col(5)
1006 6 // has 1 at col(6)
1007 1 // has 1 at col(1) and col(3)
1007 3
1009 3 // has 1 at col(3) and col(7)
1009 7
Please help me in this, Thanks in advance
请帮助我,在此先感谢
采纳答案by YOBEN_S
set_index
+ stack
, stack will dropna by default
set_index
+ stack
, 默认情况下堆栈会丢弃
df.set_index('ID',inplace=True)
df[df==1].stack().reset_index().drop(0,1)
Out[363]:
ID level_1
0 1002 2
1 1002 4
2 1004 1
3 1004 2
4 1005 5
5 1006 6
6 1007 1
7 1007 3
8 1009 3
9 1009 7
回答by cs95
np.argwhere
np.argwhere
v = np.argwhere(df.drop('ID', 1).values).T
pd.DataFrame({'ID' : df.loc[v[0], 'ID'], 'Col2' : df.columns[1:][v[1]]})
Col2 ID
0 2 1002
0 4 1002
2 1 1004
2 2 1004
3 5 1005
4 6 1006
5 1 1007
5 3 1007
7 3 1009
7 7 1009
argwhere
gets the i, j indices of all non-zero elements in your DataFrame. Use the first column of indices to index into column ID
, and the second column of indices to index into df.columns
.
argwhere
获取 DataFrame 中所有非零元素的 i, j 索引。使用第一列索引索引到 column ID
,使用第二列索引索引到df.columns
。
I transpose v
before step 2 for cache efficiency, and less typing.
我v
在第 2 步之前转置以提高缓存效率,并减少打字。
回答by jezrael
Use:
用:
df = (df.melt('ID', var_name='Col2')
.query('value== 1')
.sort_values(['ID', 'Col2'])
.drop('value',1))
Alternative solution:
替代解决方案:
df = (df.set_index('ID')
.mask(lambda x: x == 0)
.stack()
.reset_index()
.drop(0,1))
print (df)
ID Col2
8 1002 2
24 1002 4
2 1004 1
10 1004 2
35 1005 5
44 1006 6
5 1007 1
21 1007 3
23 1009 3
55 1009 7
Explanation:
说明:
sort_values
for first solutioncreate columns from
MultiIndex
byreset_index
Last remove unnecessary columns by
drop
sort_values
对于第一个解决方案从创建列
MultiIndex
由reset_index
最后删除不必要的列
drop
回答by Zeel Bharatkumar Patel 1931006
you can just use idxmax over columns to reverse pd.get_dummies like
您可以在列上使用 idxmax 来反转 pd.get_dummies 之类的
one_hot_encoded = pd.get_dummies(original)
original_back = one_hot_encoded.idxmax(axis=1)
回答by Mahomet
Several great answers for the OP post. However, often get_dummies
is used for multiple categorical features. Pandas uses a prefix separator prefix_sep
to distinguish different values for a column.
OP 帖子的几个很好的答案。但是,通常get_dummies
用于多个分类特征。Pandas 使用前缀分隔符prefix_sep
来区分列的不同值。
The following function collapses a "dummified" dataframe while keeping the order of columns:
以下函数在保持列顺序的同时折叠“虚拟”数据框:
def undummify(df, prefix_sep="_"):
cols2collapse = {
item.split(prefix_sep)[0]: (prefix_sep in item) for item in df.columns
}
series_list = []
for col, needs_to_collapse in cols2collapse.items():
if needs_to_collapse:
undummified = (
df.filter(like=col)
.idxmax(axis=1)
.apply(lambda x: x.split(prefix_sep, maxsplit=1)[1])
.rename(col)
)
series_list.append(undummified)
else:
series_list.append(df[col])
undummified_df = pd.concat(series_list, axis=1)
return undummified_df
Example
例子
>>> df
a b c
0 A_1 B_1 C_1
1 A_2 B_2 C_2
>>> df2 = pd.get_dummies(df)
>>> df2
a_A_1 a_A_2 b_B_1 b_B_2 c_C_1 c_C_2
0 1 0 1 0 1 0
1 0 1 0 1 0 1
>>> df3 = undummify(df2)
>>> df3
a b c
0 A_1 B_1 C_1
1 A_2 B_2 C_2
回答by Tarun Bhavnani
https://stackoverflow.com/a/55757342/2384397
https://stackoverflow.com/a/55757342/2384397
rewriting here: Converting dat["classification"] to one hot encodes and back!!
在这里重写:将 dat["classification"] 转换为一个热编码并返回!!
import pandas as pd
将Pandas导入为 pd
from sklearn.preprocessing import LabelEncoder
从 sklearn.preprocessing 导入 LabelEncoder
dat["labels"]= le.fit_transform(dat["classification"])
数据[“标签”]= le.fit_transform(数据[“分类”])
Y= pd.get_dummies(dat["labels"])
Y= pd.get_dummies(dat["labels"])
tru=[]
真=[]
for i in range(0, len(Y)): tru.append(np.argmax(Y.iloc[i]))
对于 i in range(0, len(Y)): tru.append(np.argmax(Y.iloc[i]))
tru= le.inverse_transform(tru)
tru= le.inverse_transform(tru)
Identical check! (tru==dat["classification"]).value_counts()
一模一样的检查!(tru==dat["分类"]).value_counts()