Python 如何分析此 Pandas DataFrame 中的所有重复条目？

Question

提问by gammapoint

I'd like to be able to compute descriptive statistics on data in a Pandas DataFrame, but I only care about duplicated entries. For example, let's say I have the DataFrame created by:

我希望能够计算 Pandas DataFrame 中数据的描述性统计数据，但我只关心重复的条目。例如，假设我创建了 DataFrame：

import pandas as pd
data={'key1':[1,2,3,1,2,3,2,2],'key2':[2,2,1,2,2,4,2,2],'data':[5,6,2,6,1,6,2,8]}
frame=pd.DataFrame(data,columns=['key1','key2','data'])
print frame


     key1  key2  data
0     1     2     5
1     2     2     6
2     3     1     2
3     1     2     6
4     2     2     1
5     3     4     6
6     2     2     2
7     2     2     8

As you can see, rows 0,1,3,4,6, and 7 are all duplicates (using 'key1' and 'key2'. However, if I index this DataFrame like so:

如您所见，第 0、1、3、4、6 和 7 行都是重复的（使用“key1”和“key2”。但是，如果我像这样索引这个 DataFrame：

frame[frame.duplicated(['key1','key2'])]

I get

我得到

   key1  key2  data
3     1     2     6
4     2     2     1
6     2     2     2
7     2     2     8

(i.e., the 1st and 2nd rows do not show up because they are not indexed to True by the duplicated method).

（即，第 1 行和第 2 行没有显示，因为它们没有通过重复方法索引为 True）。

That is my first problem. My second problems deals with how to extract the descriptive statistics from this information. Forgetting the missing duplicate for the moment, let's say I want to compute the .min() and .max() for the duplicate entries (so that I can get a range). I can use groupby and these methods on the groupby object like so:

这是我的第一个问题。我的第二个问题涉及如何从这些信息中提取描述性统计数据。暂时忘记丢失的重复项，假设我想计算重复项的 .min() 和 .max() （以便我可以获得一个范围）。我可以像这样在 groupby 对象上使用 groupby 和这些方法：

a.groupby(['key1','key2']).min()

which gives

这使

           key1  key2  data
key1 key2                  
1    2        1     2     6
2    2        2     2     1

The data I want is obviously here, but what's the best way for me to extract it? How do I index the resulting object to get what I want (which is the key1,key2,data info)?

我想要的数据显然在这里，但我提取它的最佳方法是什么？如何索引结果对象以获得我想要的（即 key1、key2、data 信息）？

Answer 1

采纳答案by Yoel

EDIT for Pandas 0.17or later:

为Pandas 0.17或更高版本编辑：

As the take_lastargument of the duplicated()method was deprecatedin favour of the new keepargument since Pandas 0.17, please refer to this answerfor the correct approach:

由于自Pandas 0.17以来take_last，该duplicated()方法的参数已被弃用而支持新keep参数，请参阅此答案以获取正确方法：

Invoke the duplicated()method with keep=False, i.e. frame.duplicated(['key1', 'key2'], keep=False).

调用该duplicated()方法keep=False，即frame.duplicated(['key1', 'key2'], keep=False)。

Therefore, in order to extract the required data for this specific question, the following suffices:

因此，为了提取此特定问题所需的数据，以下内容就足够了：

In [81]: frame[frame.duplicated(['key1', 'key2'], keep=False)].groupby(('key1', 'key2')).min()
Out[81]: 
           data
key1 key2      
1    2        5
2    2        1

[2 rows x 1 columns]

Interestingly enough, this change in Pandas 0.17may be partially attributed to this question, as referred to in this issue.

有趣的是，Pandas 0.17 中的这种变化可能部分归因于这个问题，如本期所述。

For versions preceding Pandas 0.17:

对于Pandas 0.17之前的版本：

We can play with the take_lastargument of the duplicated()method:

我们可以使用方法的take_last参数duplicated()：

take_last: boolean, default False
For a set of distinct duplicate rows, flag all but the last row as duplicated. Default is for all but the first row to be flagged.

take_last: boolean, 默认False
对于一组不同的重复行，将除最后一行之外的所有行标记为重复。默认值是除第一行之外的所有要标记的行。

If we set take_last's value to True, we flag all but the last duplicate row. Combining this along with its default value of False, which flags all but the first duplicate row, allows us to flag all duplicated rows:

如果我们将take_last的值设置为True，我们会标记除最后一个重复行之外的所有行。将此与其默认值结合起来False，标记除第一个重复行之外的所有行，允许我们标记所有重复行：

In [76]: frame.duplicated(['key1', 'key2'])
Out[76]: 
0    False
1    False
2    False
3     True
4     True
5    False
6     True
7     True
dtype: bool

In [77]: frame.duplicated(['key1', 'key2'], take_last=True)
Out[77]: 
0     True
1     True
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool

In [78]: frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])
Out[78]: 
0     True
1     True
2    False
3     True
4     True
5    False
6     True
7     True
dtype: bool

In [79]: frame[frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])]
Out[79]: 
   key1  key2  data
0     1     2     5
1     2     2     6
3     1     2     6
4     2     2     1
6     2     2     2
7     2     2     8

[6 rows x 3 columns]

Now we just need to use the groupbyand minmethods, and I believe the output is in the required format:

现在我们只需要使用groupby和min方法，我相信输出是所需的格式：

In [81]: frame[frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])].groupby(('key1', 'key2')).min()
Out[81]: 
           data
key1 key2      
1    2        5
2    2        1

[2 rows x 1 columns]

Answer 2

回答by Alex Riley

Here's one possible solution to return allduplicated values in the two columns (i.e. rows 0, 1, 3, 4, 6, 7):

这是返回两列（即第 0、1、3、4、6、7 行）中所有重复值的一种可能解决方案：

>>> key1_dups = frame.key1[frame.key1.duplicated()].values
>>> key2_dups = frame.key2[frame.key2.duplicated()].values
>>> frame[frame.key1.isin(key1_dups) & frame.key2.isin(key2_dups)]
   key1  key2  data
0     1     2     5
1     2     2     6
3     1     2     6
4     2     2     1
6     2     2     2
7     2     2     8

(Edit: actually, the df.duplicated(take_last=True) | df.duplicated()method in @Yoel's answeris neater.)

（编辑：实际上，该df.duplicated(take_last=True) | df.duplicated()方法在@约尔的回答是简洁。）

To query the results of your groupbyoperation, you can use loc. For example:

要查询您的groupby操作结果，您可以使用loc. 例如：

>>> dups = frame[frame.key1.isin(key1_dups) & frame.key2.isin(key2_dups)]
>>> grouped = dups.groupby(['key1','key2']).min()
>>> grouped
           data
key1 key2      
1    2        5
2    2        1

>>> grouped.loc[1, 2]
    data    5
Name: (1, 2), dtype: int64

Alternatively, turn groupedback into a "normal-looking" DataFrame by resetting both indexes:

或者，grouped通过重置两个索引，变回“看起来正常”的 DataFrame：

>>> grouped.reset_index(level=0).reset_index(level=0)
   key2  key1  data
0     2     1     5
1     2     2     1

Answer 3

回答by user666

To get a list of all the duplicated entries with Pandas version 0.17, you can simply set 'keep = False' in the duplicatedfunction.

要获取 Pandas 0.17 版中所有重复条目的列表，您只需在重复函数中设置 'keep = False' 。

frame[frame.duplicated(['key1','key2'],keep=False)]

    key1  key2  data
0     1     2     5
1     2     2     6
3     1     2     6
4     2     2     1
6     2     2     2
7     2     2     8

Python 如何分析此 Pandas DataFrame 中的所有重复条目？

提问by gammapoint

采纳答案by Yoel

回答by Alex Riley

回答by user666

相关推荐

最近更新

标签

Python 如何分析此 Pandas DataFrame 中的所有重复条目？

提问by gammapoint

采纳答案by Yoel

回答by Alex Riley

回答by user666

相关推荐

Python 你如何按多列过滤熊猫数据框

Python “内容”和“文本”有什么区别

Python Scikit-learn GridSearch 给出“ValueError: multiclass format is not supported”错误

Python 去掉字节 b' '

相关推荐

最近更新

标签