pandas 使用 get_dummies 时删除冗余列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50176096/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:31:49  来源:igfitidea点击:

removing redundant columns when using get_dummies

pythonpandascategorical-data

提问by gabboshow

Hi have a pandas dataframe dfcontaining categorical variables.

嗨,有一个df包含分类变量的Pandas数据框。

df=pandas.DataFrame(data=[['male','blue'],['female','brown'],
['male','black']],columns=['gender','eyes'])

df
Out[16]: 
   gender   eyes
0    male   blue
1  female  brown
2    male  black

using the function get_dummies I get the following dataframe

使用函数 get_dummies 我得到以下数据帧

df_dummies = pandas.get_dummies(df)

df_dummies
Out[18]: 
   gender_female  gender_male  eyes_black  eyes_blue  eyes_brown
0              0            1           0          1           0
1              1            0           0          0           1
2              0            1           1          0           0

Owever the columns gender_femaleand gender_malecontain the same information because the original column could assume a binary value. Is there a (smart) way to keep only one of the 2 columns?

Owever 列gender_femalegender_male包含相同的信息,因为原始列可以采用二进制值。有没有(智能)方法只保留两列中的一列?

UPDATED

更新

The use of

指某东西的用途

df_dummies = pandas.get_dummies(df,drop_first=True)

Would give me

会给我

df_dummies
Out[21]: 
   gender_male  eyes_blue  eyes_brown
0            1          1           0
1            0          0           1
2            1          0           0

but I would like to remove the columns for which originally I had only 2 possibilities

但我想删除最初只有两种可能性的列

The desired result should be

想要的结果应该是

df_dummies
Out[18]: 
   gender_male  eyes_black  eyes_blue  eyes_brown
0  1           0          1           0
1  0           0          0           1
2  1           1          0           0

回答by Joe

Yes, you can use the argument dropfirst:

是的,您可以使用参数dropfirst

drop_first=True

From the documentation:

文档

pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
   b  c
0  0  0
1  1  0
2  0  1
3  0  0
4  0  0

To have all dummy columns for eyes, and one for gender, use this:

要让所有虚拟列都为eyes,一个为gender,请使用:

df = pd.get_dummies(df, prefix=['eyes'], columns=['eyes'])
df = pd.get_dummies(df,drop_first=True)

Output:

输出:

       eyes_black  eyes_blue  eyes_brown  gender_male
0           0          1           0            1
1           0          0           1            0
2           1          0           0            1

More general:

更一般:

   gender   eyes    heigh
0    male   blue     tall
1  female  brown    short
2    male  black  average

for i in df.columns:
    if len(df.groupby([i]).size()) > 2:
         df = pd.get_dummies(df, prefix=[i], columns=[i])
df = pd.get_dummies(df, drop_first=True)

Output:

输出:

   eyes_black  eyes_blue  eyes_brown  heigh_average  heigh_short  heigh_tall  \
0           0          1           0              0            0           1   
1           0          0           1              0            1           0   
2           1          0           0              1            0           0    

   gender_male  
0            1  
1            0  
2            1

回答by asongtoruin

You could use itertools.combinationsto find all pairs of columns, then any potentially redundant pair of columns will be one where for every row one column is True and the other is False - i.e. an XOR:

您可以使用itertools.combinations来查找所有列对,然后任何潜在冗余的列对都将是其中每一行的一列是 True 另一列是 False - 即异或:

import pandas as pd
from itertools import combinations

df = pd.DataFrame(data=[['male','blue'],['female','brown'],['male','black']],
                  columns=['gender','eyes'])

dummies = pd.get_dummies(df)

for c1, c2 in combinations(dummies.columns, 2):
    if all(dummies[c1] ^ dummies[c2]):
        print(c1,c2)

However, this also notices that in your examples all females have brown eyes, hence we get the following printed:

然而,这也注意到在你的例子中所有女性都有棕色的眼睛,因此我们打印了以下内容:

gender_female gender_male
gender_male eyes_brown