pandas 如何使用python pandas从数据框中删除重复的列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16938441/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:52:34  来源:igfitidea点击:

How to remove duplicate columns from a dataframe using python pandas

pythonpandas

提问by Neer

By grouping two columns I made some changes.

通过对两列进行分组,我进行了一些更改。

I generated a file using python, it resulted in 2 duplicate columns. How to remove duplicate columns from a dataframe?

我使用 python 生成了一个文件,它导致了 2 个重复的列。如何从数据框中删除重复的列?

回答by Andy Hayden

It's probably easiest to use a groupby (assuming they have duplicate names too):

使用 groupby 可能最简单(假设它们也有重复的名称):

In [11]: df
Out[11]:
   A  B  B
0  a  4  4
1  b  4  4
2  c  4  4

In [12]: df.T.groupby(level=0).first().T
Out[12]:
   A  B
0  a  4
1  b  4
2  c  4

If they have different namesyou can drop_duplicateson the transpose:

如果他们有不同的名字,你可以drop_duplicates在转置上:

In [21]: df
Out[21]:
   A  B  C
0  a  4  4
1  b  4  4
2  c  4  4

In [22]: df.T.drop_duplicates().T
Out[22]:
   A  B
0  a  4
1  b  4
2  c  4

Usually read_csvwill usually ensure they have different names...

通常read_csv通常会确保他们有不同的名字......

回答by kalu

Transposing is a bad idea when working with large DataFrames. See this answer for a memory efficient alternative: https://stackoverflow.com/a/32961145/759442

在处理大型 DataFrame 时,转置是一个坏主意。请参阅此答案以获取内存高效的替代方案:https: //stackoverflow.com/a/32961145/759442

回答by Francisco López-Sancho

This is the best I found so far.

这是我目前找到的最好的。

remove = []
cols = df.columns
for i in range(len(cols)-1):
    v = df[cols[i]].values
    for j in range(i+1,len(cols)):
        if np.array_equal(v,df[cols[j]].values):
            remove.append(cols[j])

df.drop(remove, axis=1, inplace=True)

https://www.kaggle.com/kobakhit/santander-customer-satisfaction/0-84-score-with-36-features-only/code

https://www.kaggle.com/kobakhit/santander-customer-satisfaction/0-84-score-with-36-features-only/code

回答by Dan Carter

I understand that this is an old question, but I recently had this same issue and none of these solutions worked for me, or the looping suggestion seemed a bit overkill. In the end, I simply found the index of the undesirable duplicate column and dropped that column index. So provided you know the index of the column this will work (which you could probably find via debugging or print statements):

我知道这是一个老问题,但我最近遇到了同样的问题,这些解决方案都不适合我,或者循环建议似乎有点矫枉过正。最后,我只是找到了不需要的重复列的索引并删除了该列索引。因此,如果您知道该列的索引,这将起作用(您可能会通过调试或打印语句找到):

df.drop(df.columns[i], axis=1)

回答by yugandhar

It's already answered here python pandas remove duplicate columns. Idea is that df.columns.duplicated()generates boolean vector where each value says whether it has seen the column before or not. For example, if dfhas columns ["Col1", "Col2", "Col1"], then it generates [False, False, True]. Let's take inversion of it and call it as column_selector.

它已经在这里回答python pandas remove duplicate columns。想法是df.columns.duplicated()生成布尔向量,其中每个值表示它之前是否见过该列。例如,如果df有 columns ["Col1", "Col2", "Col1"],那么它会生成[False, False, True]. 让我们将它取反并将其称为column_selector

Using the above vector and using locmethod of dfwhich helps in selecting rows and columns, we can remove the duplicate columns. With df.loc[:, column_selector]we can select columns.

使用上述载体和使用loc方法的df,这有助于在选择行和列,我们可以删除重复的列。随着df.loc[:, column_selector]我们可以选择列。

column_selector = ~df.columns().duplicated()
df = df.loc[:, column_selector]