pandas 如何根据pandas中的列名删除重复的列数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44561649/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:47:23  来源:igfitidea点击:

how to drop duplicated columns data based on column name in pandas

pandas

提问by X.Z

Assume I have a table like below

假设我有一张如下表

    A   B   C   B
0   0   1   2   3
1   4   5   6   7

I'd like to drop column B. I tried to use drop_duplicate, but it seems that it only works based on duplicated data not header. Hope anyone know how to do this

我想删除 B 列。我尝试使用 drop_duplicate,但它似乎只适用于重复数据而不是标题。希望有人知道如何做到这一点

Thanks

谢谢

回答by jezrael

Use Index.duplicatedwith locor ilocand boolean indexing:

Index.duplicatedlociloc和一起使用boolean indexing

print (~df.columns.duplicated())
[ True  True  True False]

df = df.loc[:, ~df.columns.duplicated()]
print (df)
   A  B  C
0  0  1  2
1  4  5  6


df = df.iloc[:, ~df.columns.duplicated()]
print (df)
   A  B  C
0  0  1  2
1  4  5  6

Timings:

时间

np.random.seed(123)
cols = ['A','B','C','B']
#[1000 rows x 30 columns]
df = pd.DataFrame(np.random.randint(10, size=(1000,30)),columns = np.random.choice(cols, 30))
print (df)

In [115]: %timeit (df.groupby(level=0, axis=1).first())
1000 loops, best of 3: 1.48 ms per loop

In [116]: %timeit (df.groupby(level=0, axis=1).mean())
1000 loops, best of 3: 1.58 ms per loop

In [117]: %timeit (df.iloc[:, ~df.columns.duplicated()])
1000 loops, best of 3: 338 μs per loop

In [118]: %timeit (df.loc[:, ~df.columns.duplicated()])
1000 loops, best of 3: 346 μs per loop

enter image description here

在此处输入图片说明

enter image description here

在此处输入图片说明

回答by piRSquared

You can groupby
We use the axis=1and level=0parameters to specify that we are grouping by columns. Then use the firstmethod to grab the first column within each group defined by unique column names.

您可以groupby
使用axis=1level=0参数来指定我们按列分组。然后使用该first方法获取由唯一列名定义的每个组中的第一列。

df.groupby(level=0, axis=1).first()

   A  B  C
0  0  1  2
1  4  5  6

We could have also used last

我们也可以使用 last

df.groupby(level=0, axis=1).last()

   A  B  C
0  0  3  2
1  4  7  6

Or mean

或者 mean

df.groupby(level=0, axis=1).mean()

   A  B  C
0  0  2  2
1  4  6  6