pandas 使用 get_dummies 时删除冗余列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/50176096/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
removing redundant columns when using get_dummies
提问by gabboshow
Hi have a pandas dataframe df
containing categorical variables.
嗨,有一个df
包含分类变量的Pandas数据框。
df=pandas.DataFrame(data=[['male','blue'],['female','brown'],
['male','black']],columns=['gender','eyes'])
df
Out[16]:
gender eyes
0 male blue
1 female brown
2 male black
using the function get_dummies I get the following dataframe
使用函数 get_dummies 我得到以下数据帧
df_dummies = pandas.get_dummies(df)
df_dummies
Out[18]:
gender_female gender_male eyes_black eyes_blue eyes_brown
0 0 1 0 1 0
1 1 0 0 0 1
2 0 1 1 0 0
Owever the columns gender_female
and gender_male
contain the same information because the original column could assume a binary value. Is there a (smart) way to keep only one of the 2 columns?
Owever 列gender_female
并gender_male
包含相同的信息,因为原始列可以采用二进制值。有没有(智能)方法只保留两列中的一列?
UPDATED
更新
The use of
指某东西的用途
df_dummies = pandas.get_dummies(df,drop_first=True)
Would give me
会给我
df_dummies
Out[21]:
gender_male eyes_blue eyes_brown
0 1 1 0
1 0 0 1
2 1 0 0
but I would like to remove the columns for which originally I had only 2 possibilities
但我想删除最初只有两种可能性的列
The desired result should be
想要的结果应该是
df_dummies
Out[18]:
gender_male eyes_black eyes_blue eyes_brown
0 1 0 1 0
1 0 0 0 1
2 1 1 0 0
回答by Joe
Yes, you can use the argument dropfirst
:
是的,您可以使用参数dropfirst
:
drop_first=True
From the documentation:
从文档:
pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
b c
0 0 0
1 1 0
2 0 1
3 0 0
4 0 0
To have all dummy columns for eyes
, and one for gender
, use this:
要让所有虚拟列都为eyes
,一个为gender
,请使用:
df = pd.get_dummies(df, prefix=['eyes'], columns=['eyes'])
df = pd.get_dummies(df,drop_first=True)
Output:
输出:
eyes_black eyes_blue eyes_brown gender_male
0 0 1 0 1
1 0 0 1 0
2 1 0 0 1
More general:
更一般:
gender eyes heigh
0 male blue tall
1 female brown short
2 male black average
for i in df.columns:
if len(df.groupby([i]).size()) > 2:
df = pd.get_dummies(df, prefix=[i], columns=[i])
df = pd.get_dummies(df, drop_first=True)
Output:
输出:
eyes_black eyes_blue eyes_brown heigh_average heigh_short heigh_tall \
0 0 1 0 0 0 1
1 0 0 1 0 1 0
2 1 0 0 1 0 0
gender_male
0 1
1 0
2 1
回答by asongtoruin
You could use itertools.combinations
to find all pairs of columns, then any potentially redundant pair of columns will be one where for every row one column is True and the other is False - i.e. an XOR:
您可以使用itertools.combinations
来查找所有列对,然后任何潜在冗余的列对都将是其中每一行的一列是 True 另一列是 False - 即异或:
import pandas as pd
from itertools import combinations
df = pd.DataFrame(data=[['male','blue'],['female','brown'],['male','black']],
columns=['gender','eyes'])
dummies = pd.get_dummies(df)
for c1, c2 in combinations(dummies.columns, 2):
if all(dummies[c1] ^ dummies[c2]):
print(c1,c2)
However, this also notices that in your examples all females have brown eyes, hence we get the following printed:
然而,这也注意到在你的例子中所有女性都有棕色的眼睛,因此我们打印了以下内容:
gender_female gender_male
gender_male eyes_brown