pandas 如何使用python pandas从数据框中删除重复的列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16938441/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to remove duplicate columns from a dataframe using python pandas
提问by Neer
By grouping two columns I made some changes.
通过对两列进行分组,我进行了一些更改。
I generated a file using python, it resulted in 2 duplicate columns. How to remove duplicate columns from a dataframe?
我使用 python 生成了一个文件,它导致了 2 个重复的列。如何从数据框中删除重复的列?
回答by Andy Hayden
It's probably easiest to use a groupby (assuming they have duplicate names too):
使用 groupby 可能最简单(假设它们也有重复的名称):
In [11]: df
Out[11]:
A B B
0 a 4 4
1 b 4 4
2 c 4 4
In [12]: df.T.groupby(level=0).first().T
Out[12]:
A B
0 a 4
1 b 4
2 c 4
If they have different namesyou can drop_duplicateson the transpose:
如果他们有不同的名字,你可以drop_duplicates在转置上:
In [21]: df
Out[21]:
A B C
0 a 4 4
1 b 4 4
2 c 4 4
In [22]: df.T.drop_duplicates().T
Out[22]:
A B
0 a 4
1 b 4
2 c 4
Usually read_csvwill usually ensure they have different names...
通常read_csv通常会确保他们有不同的名字......
回答by kalu
Transposing is a bad idea when working with large DataFrames. See this answer for a memory efficient alternative: https://stackoverflow.com/a/32961145/759442
在处理大型 DataFrame 时,转置是一个坏主意。请参阅此答案以获取内存高效的替代方案:https: //stackoverflow.com/a/32961145/759442
回答by Francisco López-Sancho
This is the best I found so far.
这是我目前找到的最好的。
remove = []
cols = df.columns
for i in range(len(cols)-1):
v = df[cols[i]].values
for j in range(i+1,len(cols)):
if np.array_equal(v,df[cols[j]].values):
remove.append(cols[j])
df.drop(remove, axis=1, inplace=True)
回答by Dan Carter
I understand that this is an old question, but I recently had this same issue and none of these solutions worked for me, or the looping suggestion seemed a bit overkill. In the end, I simply found the index of the undesirable duplicate column and dropped that column index. So provided you know the index of the column this will work (which you could probably find via debugging or print statements):
我知道这是一个老问题,但我最近遇到了同样的问题,这些解决方案都不适合我,或者循环建议似乎有点矫枉过正。最后,我只是找到了不需要的重复列的索引并删除了该列索引。因此,如果您知道该列的索引,这将起作用(您可能会通过调试或打印语句找到):
df.drop(df.columns[i], axis=1)
回答by yugandhar
It's already answered here python pandas remove duplicate columns.
Idea is that df.columns.duplicated()generates boolean vector where each value says whether it has seen the column before or not. For example, if dfhas columns ["Col1", "Col2", "Col1"], then it generates [False, False, True]. Let's take inversion of it and call it as column_selector.
它已经在这里回答python pandas remove duplicate columns。想法是df.columns.duplicated()生成布尔向量,其中每个值表示它之前是否见过该列。例如,如果df有 columns ["Col1", "Col2", "Col1"],那么它会生成[False, False, True]. 让我们将它取反并将其称为column_selector。
Using the above vector and using locmethod of dfwhich helps in selecting rows and columns, we can remove the duplicate columns. With df.loc[:, column_selector]we can select columns.
使用上述载体和使用loc方法的df,这有助于在选择行和列,我们可以删除重复的列。随着df.loc[:, column_selector]我们可以选择列。
column_selector = ~df.columns().duplicated()
df = df.loc[:, column_selector]

