pandas 如何使用python pandas从数据框中删除重复的列

Question

提问by Neer

By grouping two columns I made some changes.

通过对两列进行分组，我进行了一些更改。

I generated a file using python, it resulted in 2 duplicate columns. How to remove duplicate columns from a dataframe?

我使用 python 生成了一个文件，它导致了 2 个重复的列。如何从数据框中删除重复的列？

Answer 1

回答by Andy Hayden

It's probably easiest to use a groupby (assuming they have duplicate names too):

使用 groupby 可能最简单（假设它们也有重复的名称）：

In [11]: df
Out[11]:
   A  B  B
0  a  4  4
1  b  4  4
2  c  4  4

In [12]: df.T.groupby(level=0).first().T
Out[12]:
   A  B
0  a  4
1  b  4
2  c  4

If they have different namesyou can drop_duplicateson the transpose:

如果他们有不同的名字，你可以drop_duplicates在转置上：

In [21]: df
Out[21]:
   A  B  C
0  a  4  4
1  b  4  4
2  c  4  4

In [22]: df.T.drop_duplicates().T
Out[22]:
   A  B
0  a  4
1  b  4
2  c  4

Usually read_csvwill usually ensure they have different names...

通常read_csv通常会确保他们有不同的名字......

Answer 2

回答by kalu

Transposing is a bad idea when working with large DataFrames. See this answer for a memory efficient alternative: https://stackoverflow.com/a/32961145/759442

在处理大型 DataFrame 时，转置是一个坏主意。请参阅此答案以获取内存高效的替代方案：https: //stackoverflow.com/a/32961145/759442

Answer 3

回答by Francisco López-Sancho

This is the best I found so far.

这是我目前找到的最好的。

remove = []
cols = df.columns
for i in range(len(cols)-1):
    v = df[cols[i]].values
    for j in range(i+1,len(cols)):
        if np.array_equal(v,df[cols[j]].values):
            remove.append(cols[j])

df.drop(remove, axis=1, inplace=True)

https://www.kaggle.com/kobakhit/santander-customer-satisfaction/0-84-score-with-36-features-only/code

Answer 4

回答by Dan Carter

I understand that this is an old question, but I recently had this same issue and none of these solutions worked for me, or the looping suggestion seemed a bit overkill. In the end, I simply found the index of the undesirable duplicate column and dropped that column index. So provided you know the index of the column this will work (which you could probably find via debugging or print statements):

我知道这是一个老问题，但我最近遇到了同样的问题，这些解决方案都不适合我，或者循环建议似乎有点矫枉过正。最后，我只是找到了不需要的重复列的索引并删除了该列索引。因此，如果您知道该列的索引，这将起作用（您可能会通过调试或打印语句找到）：

df.drop(df.columns[i], axis=1)

Answer 5

回答by yugandhar

It's already answered here python pandas remove duplicate columns. Idea is that df.columns.duplicated()generates boolean vector where each value says whether it has seen the column before or not. For example, if dfhas columns ["Col1", "Col2", "Col1"], then it generates [False, False, True]. Let's take inversion of it and call it as column_selector.

它已经在这里回答python pandas remove duplicate columns。想法是df.columns.duplicated()生成布尔向量，其中每个值表示它之前是否见过该列。例如，如果df有 columns ["Col1", "Col2", "Col1"]，那么它会生成[False, False, True]. 让我们将它取反并将其称为column_selector。

Using the above vector and using locmethod of dfwhich helps in selecting rows and columns, we can remove the duplicate columns. With df.loc[:, column_selector]we can select columns.

使用上述载体和使用loc方法的df，这有助于在选择行和列，我们可以删除重复的列。随着df.loc[:, column_selector]我们可以选择列。

column_selector = ~df.columns().duplicated()
df = df.loc[:, column_selector]

pandas 如何使用python pandas从数据框中删除重复的列

提问by Neer

回答by Andy Hayden

回答by kalu

回答by Francisco López-Sancho

回答by Dan Carter

回答by yugandhar

相关推荐

最近更新

标签

pandas 如何使用python pandas从数据框中删除重复的列

提问by Neer

回答by Andy Hayden

回答by kalu

回答by Francisco López-Sancho

回答by Dan Carter

回答by yugandhar

相关推荐

如何将python pandas scatter_matrix另存为图形？

如何计算 Pandas 数据帧组中索引或空值的数量

在 Pandas 中合并 2 个数据框：加入一些列，总结其他列

使用 .map() 在 Pandas DataFrame 中有效地创建附加列

相关推荐

最近更新

标签