Python 如何计算熊猫数据框中的重复行？

Question

提问by jss367

I am trying to count the duplicates of each type of row in my dataframe. For example, say that I have a dataframe in pandas as follows:

我正在尝试计算数据框中每种类型行的重复项。例如，假设我在 Pandas 中有一个数据框，如下所示：

df = pd.DataFrame({'one': pd.Series([1., 1, 1]),
                   'two': pd.Series([1., 2., 1])})

I get a df that looks like this:

我得到一个看起来像这样的 df：

I imagine the first step is to find all the different unique rows, which I do by:

我想第一步是找到所有不同的唯一行，我这样做：

df.drop_duplicates()

This gives me the following df:

这给了我以下 df：

    one two
0   1   1
1   1   2

Now I want to take each row from the above df ([1 1] and [1 2]) and get a count of how many times each is in the initial df. My result would look something like this:

现在我想从上面的 df ([1 1] 和 [1 2]) 中取出每一行，并计算每行在初始 df 中的次数。我的结果看起来像这样：

Row     Count
[1 1]     2
[1 2]     1

How should I go about doing this last step?

我应该如何做这最后一步？

Edit:

编辑：

Here's a larger example to make it more clear:

这是一个更大的例子，可以更清楚地说明：

df = pd.DataFrame({'one': pd.Series([True, True, True, False]),
                   'two': pd.Series([True, False, False, True]),
                   'three': pd.Series([True, False, False, False])})

gives me:

给我：

    one three   two
0   True    True    True
1   True    False   False
2   True    False   False
3   False   False   True

I want a result that tells me:

我想要一个结果告诉我：

       Row           Count
[True True True]       1
[True False False]     2
[False False True]     1

Answer 1

采纳答案by EdChum

You can groupbyon all the columns and call sizethe index indicates the duplicate values:

您可以groupby对所有列并调用size索引指示重复值：

In [28]:
df.groupby(df.columns.tolist(),as_index=False).size()

Out[28]:
one    three  two  
False  False  True     1
True   False  False    2
       True   True     1
dtype: int64

Answer 2

回答by Jarad

df = pd.DataFrame({'one' : pd.Series([1., 1, 1, 3]), 'two' : pd.Series([1., 2., 1, 3] ), 'three' : pd.Series([1., 2., 1, 2] )})
df['str_list'] = df.apply(lambda row: ' '.join([str(int(val)) for val in row]), axis=1)
df1 = pd.DataFrame(df['str_list'].value_counts().values, index=df['str_list'].value_counts().index, columns=['Count'])

Produces:

产生：

>>> df1
       Count
1 1 1      2
3 2 3      1
1 2 2      1

If the index values must be a list, you could take the above code a step further with:

如果索引值必须是一个列表，你可以进一步使用上面的代码：

df1.index = df1.index.str.split()

Produces:

产生：

           Count
[1, 1, 1]      2
[3, 2, 3]      1
[1, 2, 2]      1

Answer 3

回答by Denis

df.groupby(df.columns.tolist()).size().reset_index().\
    rename(columns={0:'records'})

   one  two  records
0    1    1        2
1    1    2        1

Answer 4

回答by Arash

If you like to count duplicates on particular column(s):

如果您想计算特定列上的重复项：

len(df['one'])-len(df['one'].drop_duplicates())

If you want to count duplicates on entire dataframe:

如果要计算整个数据帧上的重复项：

len(df)-len(df.drop_duplicates())

Or simply you can use DataFrame.duplicated(subset=None, keep='first'):

或者干脆你可以使用DataFrame.duplicated(subset=None, keep='first')：

df.duplicated(subset='one', keep='first').sum()

where

在哪里

subset: column label or sequence of labels(by default use all of the columns)

子集：列标签或标签序列（默认使用所有列）

keep: {‘first', ‘last', False}, default ‘first'

保持：{'first', 'last', False}，默认为'first'

first: Mark duplicates as True except for the first occurrence.
last: Mark duplicates as True except for the last occurrence.
False: Mark all duplicates as True.

first：除第一次出现外，将重复项标记为 True。
last: 除最后一次出现外，将重复项标记为 True。
False：将所有重复项标记为 True。

Answer 5

回答by olisteadman

None of the existing answers quite offers a simple solution that returns "the number of rows that are just duplicates and should be cut out". This is a one-size-fits-all solution that does:

现有的答案都没有提供一个简单的解决方案，该解决方案返回“只是重复并且应该被删除的行数”。这是一个一刀切的解决方案，它可以：

# generate a table of those culprit rows which are duplicated:
dups = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'})

# sum the final col of that table, and subtract the number of culprits:
dups['count'].sum() - dups.shape[0]

Answer 6

回答by Sergey Zaitsev

I use:

我用：

used_features =[
    "one",
    "two",
    "three"
]

df['is_duplicated'] = df.duplicated(used_features)
df['is_duplicated'].sum()

which gives count of duplicated rows, and then you can analyse them by a new column. I didn't see such solution here.

它给出了重复行的计数，然后你可以通过一个新列来分析它们。我在这里没有看到这样的解决方案。

Answer 7

回答by ThibTrip

ran into this problem today and wanted to include NaNs so I replace them temporarily with "" (empty string). Please comment if you do not understand something :). This solution assumes that "" is not a relevant value for you. It should also work with numerical data (I have tested it sucessfully but not extensively) since pandas will infer the data type again after replacing "" with np.nan.

今天遇到了这个问题，想包含 NaN，所以我暂时用“”（空字符串）替换了它们。如果您有不明白的地方，请发表评论:)。此解决方案假定 "" 对您来说不是相关值。它也应该适用于数值数据（我已经成功地测试了它，但没有广泛测试），因为在用 np.nan 替换 "" 后，pandas 会再次推断数据类型。

import pandas as pd

# create test data
df = pd.DataFrame({'test':['foo','bar',None,None,'foo'],
                  'test2':['bar',None,None,None,'bar'],
                  'test3':[None, 'foo','bar',None,None]})

# fill null values with '' to not lose them during groupby
# groupby all columns and calculate the length of the resulting groups
# rename the series obtained with groupby to "group_count"
# reset the index to get a DataFrame
# replace '' with np.nan (this reverts our first operation)
# sort DataFrame by "group_count" descending
df = (df.fillna('')\
      .groupby(df.columns.tolist()).apply(len)\
      .rename('group_count')\
      .reset_index()\
      .replace('',np.nan)\
      .sort_values(by = ['group_count'], ascending = False))
df

  test test2 test3  group_count
3  foo   bar   NaN            2
0  NaN   NaN   NaN            1
1  NaN   NaN   bar            1
2  bar   NaN   foo            1

Python 如何计算熊猫数据框中的重复行？

提问by jss367

采纳答案by EdChum

回答by Jarad

回答by Denis

回答by Arash

回答by olisteadman

回答by Sergey Zaitsev

回答by ThibTrip

相关推荐

最近更新

标签

Python 如何计算熊猫数据框中的重复行？

提问by jss367

采纳答案by EdChum

回答by Jarad

回答by Denis

回答by Arash

回答by olisteadman

回答by Sergey Zaitsev

回答by ThibTrip

相关推荐

检查python中命令行参数的数量

Python Traceback（最近一次调用最后一次）

Python Django 迁移：不创建表

Python 类型错误：不支持 / 的操作数类型：'list' 和 'int'

相关推荐

最近更新

标签