Python（pandas）：基于两列删除重复项，在另一列中保留具有最大值的行

Question

提问by Elsalex

I have a dataframe which contains duplicates values according to two columns (A and B):

我有一个数据框，其中包含根据两列（A 和 B）的重复值：

I want to remove duplicates keeping the row with max value in column C. This would lead to:

我想删除重复项，保留 C 列中具有最大值的行。这将导致：

I cannot figure out how to do that. Should I use drop_duplicates(), something else?

我不知道该怎么做。我应该使用drop_duplicates()其他东西吗？

Answer 1

采纳答案by JoeCondron

You can do it using group by:

您可以使用 group by 来完成：

c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]

c_maxesis a Seriesof the maximum values of Cin each group but which is of the same length and with the same index as df. If you haven't used .transformthen printing c_maxesmight be a good idea to see how it works.

c_maxes是每个组Series中的最大值C但与具有相同的长度和相同的索引df。如果您还没有使用过，.transform那么打印c_maxes可能是一个好主意，看看它是如何工作的。

Another approach using drop_duplicateswould be

另一种使用drop_duplicates方法是

df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)

Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.

不确定哪个更有效，但我猜是第一种方法，因为它不涉及排序。

EDIT:From pandas 0.18up the second solution would be

编辑：从pandas 0.18第二个解决方案是

df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')

or, alternatively,

或者，或者，

df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])

In any case, the groupbysolution seems to be significantly more performing:

无论如何，该groupby解决方案的性能似乎要好得多：

%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop

%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop

Answer 2

回答by b10n

I think groupby should work.

我认为 groupby 应该有效。

df.groupby(['A', 'B']).max()['C']

If you need a dataframe back you can chain the reset index call.

如果您需要返回数据帧，您可以链接重置索引调用。

df.groupby(['A', 'B']).max()['C'].reset_index()

Answer 3

回答by AlexT

You can do it with drop_duplicatesas you wanted

你可以drop_duplicates随心所欲

# initialisation
d = pd.DataFrame({'A' : [1,1,2,3,3], 'B' : [2,2,7,4,4],  'C' : [1,4,1,0,8]})

d = d.sort_values("C", ascending=False)
d = d.drop_duplicates(["A","B"])

If it's important to get the same order

如果获得相同的订单很重要

d = d.sort_index()

Answer 4

回答by Sudharsan

You can do this simply by using pandas drop duplicates function

您可以简单地使用熊猫删除重复项功能来做到这一点

df.drop_duplicates(['A','B'],keep= 'last')

Python（pandas）：基于两列删除重复项，在另一列中保留具有最大值的行

提问by Elsalex

采纳答案by JoeCondron

回答by b10n

回答by AlexT

回答by Sudharsan

相关推荐

最近更新

标签

Python（pandas）：基于两列删除重复项，在另一列中保留具有最大值的行

提问by Elsalex

采纳答案by JoeCondron

回答by b10n

回答by AlexT

回答by Sudharsan

相关推荐

Python 如何自动修复无效的 JSON 字符串？

Python3 AttributeError: 'list' 对象没有属性 'clear'

使用 Selenium 和 Python 列出选择选项值

从另一个文件调用 Python 函数

相关推荐

最近更新

标签