Python 如何分析此 Pandas DataFrame 中的所有重复条目?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26244309/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to analyze all duplicate entries in this Pandas DataFrame?
提问by gammapoint
I'd like to be able to compute descriptive statistics on data in a Pandas DataFrame, but I only care about duplicated entries. For example, let's say I have the DataFrame created by:
我希望能够计算 Pandas DataFrame 中数据的描述性统计数据,但我只关心重复的条目。例如,假设我创建了 DataFrame:
import pandas as pd
data={'key1':[1,2,3,1,2,3,2,2],'key2':[2,2,1,2,2,4,2,2],'data':[5,6,2,6,1,6,2,8]}
frame=pd.DataFrame(data,columns=['key1','key2','data'])
print frame
key1 key2 data
0 1 2 5
1 2 2 6
2 3 1 2
3 1 2 6
4 2 2 1
5 3 4 6
6 2 2 2
7 2 2 8
As you can see, rows 0,1,3,4,6, and 7 are all duplicates (using 'key1' and 'key2'. However, if I index this DataFrame like so:
如您所见,第 0、1、3、4、6 和 7 行都是重复的(使用“key1”和“key2”。但是,如果我像这样索引这个 DataFrame:
frame[frame.duplicated(['key1','key2'])]
I get
我得到
key1 key2 data
3 1 2 6
4 2 2 1
6 2 2 2
7 2 2 8
(i.e., the 1st and 2nd rows do not show up because they are not indexed to True by the duplicated method).
(即,第 1 行和第 2 行没有显示,因为它们没有通过重复方法索引为 True)。
That is my first problem. My second problems deals with how to extract the descriptive statistics from this information. Forgetting the missing duplicate for the moment, let's say I want to compute the .min() and .max() for the duplicate entries (so that I can get a range). I can use groupby and these methods on the groupby object like so:
这是我的第一个问题。我的第二个问题涉及如何从这些信息中提取描述性统计数据。暂时忘记丢失的重复项,假设我想计算重复项的 .min() 和 .max() (以便我可以获得一个范围)。我可以像这样在 groupby 对象上使用 groupby 和这些方法:
a.groupby(['key1','key2']).min()
which gives
这使
key1 key2 data
key1 key2
1 2 1 2 6
2 2 2 2 1
The data I want is obviously here, but what's the best way for me to extract it? How do I index the resulting object to get what I want (which is the key1,key2,data info)?
我想要的数据显然在这里,但我提取它的最佳方法是什么?如何索引结果对象以获得我想要的(即 key1、key2、data 信息)?
采纳答案by Yoel
EDIT for Pandas 0.17or later:
为Pandas 0.17或更高版本编辑:
As the take_lastargument of the duplicated()method was deprecatedin favour of the new keepargument since Pandas 0.17, please refer to this answerfor the correct approach:
由于自Pandas 0.17以来take_last,该duplicated()方法的参数已被弃用而支持新keep参数,请参阅此答案以获取正确方法:
- Invoke the
duplicated()method withkeep=False, i.e.frame.duplicated(['key1', 'key2'], keep=False).
- 调用该
duplicated()方法keep=False,即frame.duplicated(['key1', 'key2'], keep=False)。
Therefore, in order to extract the required data for this specific question, the following suffices:
因此,为了提取此特定问题所需的数据,以下内容就足够了:
In [81]: frame[frame.duplicated(['key1', 'key2'], keep=False)].groupby(('key1', 'key2')).min()
Out[81]:
data
key1 key2
1 2 5
2 2 1
[2 rows x 1 columns]
Interestingly enough, this change in Pandas 0.17may be partially attributed to this question, as referred to in this issue.
有趣的是,Pandas 0.17 中的这种变化可能部分归因于这个问题,如本期所述。
For versions preceding Pandas 0.17:
对于Pandas 0.17之前的版本:
We can play with the take_lastargument of the duplicated()method:
我们可以使用方法的take_last参数duplicated():
take_last:boolean, defaultFalseFor a set of distinct duplicate rows, flag all but the last row as duplicated. Default is for all but the first row to be flagged.
take_last:boolean, 默认False对于一组不同的重复行,将除最后一行之外的所有行标记为重复。默认值是除第一行之外的所有要标记的行。
If we set take_last's value to True, we flag all but the last duplicate row. Combining this along with its default value of False, which flags all but the first duplicate row, allows us to flag all duplicated rows:
如果我们将take_last的值设置为True,我们会标记除最后一个重复行之外的所有行。将此与其默认值 结合起来False,标记除第一个重复行之外的所有行,允许我们标记所有重复行:
In [76]: frame.duplicated(['key1', 'key2'])
Out[76]:
0 False
1 False
2 False
3 True
4 True
5 False
6 True
7 True
dtype: bool
In [77]: frame.duplicated(['key1', 'key2'], take_last=True)
Out[77]:
0 True
1 True
2 False
3 False
4 True
5 False
6 True
7 False
dtype: bool
In [78]: frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])
Out[78]:
0 True
1 True
2 False
3 True
4 True
5 False
6 True
7 True
dtype: bool
In [79]: frame[frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])]
Out[79]:
key1 key2 data
0 1 2 5
1 2 2 6
3 1 2 6
4 2 2 1
6 2 2 2
7 2 2 8
[6 rows x 3 columns]
Now we just need to use the groupbyand minmethods, and I believe the output is in the required format:
现在我们只需要使用groupby和min方法,我相信输出是所需的格式:
In [81]: frame[frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])].groupby(('key1', 'key2')).min()
Out[81]:
data
key1 key2
1 2 5
2 2 1
[2 rows x 1 columns]
回答by Alex Riley
Here's one possible solution to return allduplicated values in the two columns (i.e. rows 0, 1, 3, 4, 6, 7):
这是返回两列(即第 0、1、3、4、6、7 行)中所有重复值的一种可能解决方案:
>>> key1_dups = frame.key1[frame.key1.duplicated()].values
>>> key2_dups = frame.key2[frame.key2.duplicated()].values
>>> frame[frame.key1.isin(key1_dups) & frame.key2.isin(key2_dups)]
key1 key2 data
0 1 2 5
1 2 2 6
3 1 2 6
4 2 2 1
6 2 2 2
7 2 2 8
(Edit: actually, the df.duplicated(take_last=True) | df.duplicated()method in @Yoel's answeris neater.)
(编辑:实际上,该df.duplicated(take_last=True) | df.duplicated()方法在@约尔的回答是简洁。)
To query the results of your groupbyoperation, you can use loc. For example:
要查询您的groupby操作结果,您可以使用loc. 例如:
>>> dups = frame[frame.key1.isin(key1_dups) & frame.key2.isin(key2_dups)]
>>> grouped = dups.groupby(['key1','key2']).min()
>>> grouped
data
key1 key2
1 2 5
2 2 1
>>> grouped.loc[1, 2]
data 5
Name: (1, 2), dtype: int64
Alternatively, turn groupedback into a "normal-looking" DataFrame by resetting both indexes:
或者,grouped通过重置两个索引,变回“看起来正常”的 DataFrame:
>>> grouped.reset_index(level=0).reset_index(level=0)
key2 key1 data
0 2 1 5
1 2 2 1
回答by user666
To get a list of all the duplicated entries with Pandas version 0.17, you can simply set 'keep = False' in the duplicatedfunction.
要获取 Pandas 0.17 版中所有重复条目的列表,您只需在重复函数中设置 'keep = False' 。
frame[frame.duplicated(['key1','key2'],keep=False)]
key1 key2 data
0 1 2 5
1 2 2 6
3 1 2 6
4 2 2 1
6 2 2 2
7 2 2 8

