pandas 删除pandas.Dataframe中重复列的快速方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32041245/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:46:58  来源:igfitidea点击:

Fast method for removing duplicate columns in pandas.Dataframe

pythonpandas

提问by Peter Klauke

so by using

所以通过使用

df_ab = pd.concat([df_a, df_b], axis=1, join='inner')

I get a Dataframe looking like this:

我得到一个看起来像这样的数据框:

    A    A    B    B
0   5    5   10   10
1   6    6   19   19

and I want to remove its multiple columns:

我想删除它的多列:

    A     B
0   5    10
1   6    19

Because df_a and df_b are subsets of the same Dataframe I know that all rows have the same values if the column name is the same. I have a working solution:

因为 df_a 和 df_b 是同一个 Dataframe 的子集,我知道如果列名相同,所有行都具有相同的值。我有一个可行的解决方案:

df_ab = df_ab.T.drop_duplicates().T

but I have a number of rows so this one is very slow. Does someone have a faster solution? I would prefer a solution where explicit knowledge of the column names isn't needed.

但我有很多行,所以这一行很慢。有人有更快的解决方案吗?我更喜欢不需要明确了解列名的解决方案。

采纳答案by behzad.nouri

You may use np.uniqueto get indices of unique columns, and then use .iloc:

您可以使用np.unique来获取唯一列的索引,然后使用.iloc

>>> df
   A  A   B   B
0  5  5  10  10
1  6  6  19  19
>>> _, i = np.unique(df.columns, return_index=True)
>>> df.iloc[:, i]
   A   B
0  5  10
1  6  19

回答by Prayson W. Daniel

The easiest way is:

最简单的方法是:

df = df.loc[:,~df.columns.duplicated()]

One line of code can change everything

一行代码可以改变一切

回答by unutbu

Perhaps you would be better off avoiding the problem altogether, by using pd.mergeinstead of pd.concat:

也许你最好完全避免这个问题,使用pd.merge代替pd.concat

df_ab = pd.merge(df_a, df_b, how='inner')

This will merge df_aand df_bon all columns shared in common.

这将合并df_a,并df_b在所有列在共同分享。

回答by James Wright

For those who skip the question and look straight at answers, the simplest way for me is to use OP's solution (assuming you don't run into the same performance issues he did: Transpose the dataframe, use drop_duplicates, and then Transpose it again:

对于那些跳过问题直接看答案的人,对我来说最简单的方法是使用 OP 的解决方案(假设您没有遇到他所做的相同的性能问题:转置数据帧,使用 drop_duplicates,然后再次转置它:

df.T.drop_duplicates().T