pandas 如何使用熊猫选择重复的行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41042996/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:35:39  来源:igfitidea点击:

How to select duplicate rows with pandas?

pythonpandasdataframesubtractiondivide

提问by Federico Gentile

I have a dataframe like this:

我有一个这样的数据框:

import pandas as pd
dic = {'A':[100,200,250,300],
       'B':['ci','ci','po','pa'],
       'C':['s','t','p','w']}
df = pd.DataFrame(dic)

My goal is to separate the row in 2 dataframes:

我的目标是将行分成 2 个数据帧:

  • df1 = contains all the rows that do not repeat values along column B(unque rows).
  • df2 = containts only the rows who repeat themeselves.
  • df1 = 包含沿列不重复值的所有行(唯一B行)。
  • df2 = 只包含重复自己的行。

The result should look like this:

结果应如下所示:

df1 =      A  B C         df2 =     A  B C
      0  250 po p               0  100 ci s 
      1  300 pa w               1  250 ci t

Note:

笔记:

  • the dataframes could be in general very big and have many values that repeat in column B so the answer should be as generic as possible
    • if there are no duplicates, df2 should be empty! all the results should be in df1
  • 数据框通常可能非常大,并且有许多值在 B 列中重复,因此答案应尽可能通用
    • 如果没有重复,df2 应该是空的!所有结果都应该在 df1 中

回答by jezrael

You can use Series.duplicatedwith parameter keep=Falseto create a mask for all duplicates and then boolean indexing, ~to invert the mask:

您可以使用Series.duplicated与参数keep=False创建所有重复一个面具,然后boolean indexing~反转mask

mask = df.B.duplicated(keep=False)
print (mask)
0     True
1     True
2    False
3    False
Name: B, dtype: bool

print (df[mask])
     A   B  C
0  100  ci  s
1  200  ci  t

print (df[~mask])
     A   B  C
2  250  po  p
3  300  pa  w