pandas 删除pandas.Dataframe中重复列的快速方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32041245/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fast method for removing duplicate columns in pandas.Dataframe
提问by Peter Klauke
so by using
所以通过使用
df_ab = pd.concat([df_a, df_b], axis=1, join='inner')
I get a Dataframe looking like this:
我得到一个看起来像这样的数据框:
A A B B
0 5 5 10 10
1 6 6 19 19
and I want to remove its multiple columns:
我想删除它的多列:
A B
0 5 10
1 6 19
Because df_a and df_b are subsets of the same Dataframe I know that all rows have the same values if the column name is the same. I have a working solution:
因为 df_a 和 df_b 是同一个 Dataframe 的子集,我知道如果列名相同,所有行都具有相同的值。我有一个可行的解决方案:
df_ab = df_ab.T.drop_duplicates().T
but I have a number of rows so this one is very slow. Does someone have a faster solution? I would prefer a solution where explicit knowledge of the column names isn't needed.
但我有很多行,所以这一行很慢。有人有更快的解决方案吗?我更喜欢不需要明确了解列名的解决方案。
采纳答案by behzad.nouri
回答by Prayson W. Daniel
The easiest way is:
最简单的方法是:
df = df.loc[:,~df.columns.duplicated()]
One line of code can change everything
一行代码可以改变一切
回答by unutbu
Perhaps you would be better off avoiding the problem altogether, by using pd.mergeinstead of pd.concat:
也许你最好完全避免这个问题,使用pd.merge代替pd.concat:
df_ab = pd.merge(df_a, df_b, how='inner')
This will merge df_aand df_bon all columns shared in common.
这将合并df_a,并df_b在所有列在共同分享。
回答by James Wright
For those who skip the question and look straight at answers, the simplest way for me is to use OP's solution (assuming you don't run into the same performance issues he did: Transpose the dataframe, use drop_duplicates, and then Transpose it again:
对于那些跳过问题直接看答案的人,对我来说最简单的方法是使用 OP 的解决方案(假设您没有遇到他所做的相同的性能问题:转置数据帧,使用 drop_duplicates,然后再次转置它:
df.T.drop_duplicates().T

