Pandas - 删除多列中的重复项

Question

提问by sylv

I am trying to efficiently remove duplicates in Pandas in which duplicates are inverted across two columns. For example, in this data frame:

我试图有效地删除 Pandas 中的重复项，其中重复项在两列之间反转。例如，在这个数据框中：

import pandas as pd
key = pd.DataFrame({'p1':['a','b','a','a','b','d','c'],'p2':['b','a','c','d','c','a','b'],'value':[1,1,2,3,5,3,5]})
df = pd.DataFrame(key,columns=['p1','p2','value'])
print frame

       p1 p2 value
    0  a  b    1
    1  b  a    1
    2  a  c    2
    3  a  d    3
    4  b  c    5
    5  d  a    3
    6  c  b    5

I would want to remove rows 1, 5 and 6, leaving me with just:

我想删除第 1、5 和 6 行，只留下：

      p1 p2 value
    0  a  b    1
    2  a  c    2
    3  a  d    3
    4  b  c    5

Thanks in advance for ideas on how to do this.

预先感谢您提供有关如何执行此操作的想法。

Answer 1

回答by unutbu

Reorder the p1 and p2 values so they appear in a canonical order:

重新排序 p1 和 p2 值，使它们以规范顺序出现：

mask = df['p1'] < df['p2']
df['first'] = df['p1'].where(mask, df['p2'])
df['second'] = df['p2'].where(mask, df['p1'])

yields

产量

In [149]: df
Out[149]: 
  p1 p2  value first second
0  a  b      1     a      b
1  b  a      1     a      b
2  a  c      2     a      c
3  a  d      3     a      d
4  b  c      5     b      c
5  d  a      3     a      d
6  c  b      5     b      c

Then you can drop_duplicates:

然后你可以 drop_duplicates：

df = df.drop_duplicates(subset=['value', 'first', 'second'])

import pandas as pd
key = pd.DataFrame({'p1':['a','b','a','a','b','d','c'],'p2':['b','a','c','d','c','a','b'],'value':[1,1,2,3,5,3,5]})
df = pd.DataFrame(key,columns=['p1','p2','value'])

mask = df['p1'] < df['p2']
df['first'] = df['p1'].where(mask, df['p2'])
df['second'] = df['p2'].where(mask, df['p1'])
df = df.drop_duplicates(subset=['value', 'first', 'second'])
df = df[['p1', 'p2', 'value']]

yields

产量

In [151]: df
Out[151]: 
  p1 p2  value
0  a  b      1
2  a  c      2
3  a  d      3
4  b  c      5

Pandas - 删除多列中的重复项

提问by sylv

回答by unutbu

相关推荐

最近更新

标签

Pandas - 删除多列中的重复项

提问by sylv

回答by unutbu

相关推荐

pandas 索引错误：索引 1 超出轴 1 的范围，大小为 1

根据 Pandas 0.16 中的条件更新列

pandas 熊猫：如何使用 _iLocIndexer？

Pandas：始终选择 Excel 工作表中的第一个工作表/标签

相关推荐

最近更新

标签