Python 如何计算熊猫数据框中的重复行?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35584085/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to count duplicate rows in pandas dataframe?
提问by jss367
I am trying to count the duplicates of each type of row in my dataframe. For example, say that I have a dataframe in pandas as follows:
我正在尝试计算数据框中每种类型行的重复项。例如,假设我在 Pandas 中有一个数据框,如下所示:
df = pd.DataFrame({'one': pd.Series([1., 1, 1]),
'two': pd.Series([1., 2., 1])})
I get a df that looks like this:
我得到一个看起来像这样的 df:
one two
0 1 1
1 1 2
2 1 1
I imagine the first step is to find all the different unique rows, which I do by:
我想第一步是找到所有不同的唯一行,我这样做:
df.drop_duplicates()
This gives me the following df:
这给了我以下 df:
one two
0 1 1
1 1 2
Now I want to take each row from the above df ([1 1] and [1 2]) and get a count of how many times each is in the initial df. My result would look something like this:
现在我想从上面的 df ([1 1] 和 [1 2]) 中取出每一行,并计算每行在初始 df 中的次数。我的结果看起来像这样:
Row Count
[1 1] 2
[1 2] 1
How should I go about doing this last step?
我应该如何做这最后一步?
Edit:
编辑:
Here's a larger example to make it more clear:
这是一个更大的例子,可以更清楚地说明:
df = pd.DataFrame({'one': pd.Series([True, True, True, False]),
'two': pd.Series([True, False, False, True]),
'three': pd.Series([True, False, False, False])})
gives me:
给我:
one three two
0 True True True
1 True False False
2 True False False
3 False False True
I want a result that tells me:
我想要一个结果告诉我:
Row Count
[True True True] 1
[True False False] 2
[False False True] 1
采纳答案by EdChum
You can groupby
on all the columns and call size
the index indicates the duplicate values:
您可以groupby
对所有列并调用size
索引指示重复值:
In [28]:
df.groupby(df.columns.tolist(),as_index=False).size()
Out[28]:
one three two
False False True 1
True False False 2
True True 1
dtype: int64
回答by Jarad
df = pd.DataFrame({'one' : pd.Series([1., 1, 1, 3]), 'two' : pd.Series([1., 2., 1, 3] ), 'three' : pd.Series([1., 2., 1, 2] )})
df['str_list'] = df.apply(lambda row: ' '.join([str(int(val)) for val in row]), axis=1)
df1 = pd.DataFrame(df['str_list'].value_counts().values, index=df['str_list'].value_counts().index, columns=['Count'])
Produces:
产生:
>>> df1
Count
1 1 1 2
3 2 3 1
1 2 2 1
If the index values must be a list, you could take the above code a step further with:
如果索引值必须是一个列表,你可以进一步使用上面的代码:
df1.index = df1.index.str.split()
df1.index = df1.index.str.split()
Produces:
产生:
Count
[1, 1, 1] 2
[3, 2, 3] 1
[1, 2, 2] 1
回答by Denis
df.groupby(df.columns.tolist()).size().reset_index().\
rename(columns={0:'records'})
one two records
0 1 1 2
1 1 2 1
回答by Arash
If you like to count duplicates on particular column(s):
如果您想计算特定列上的重复项:
len(df['one'])-len(df['one'].drop_duplicates())
If you want to count duplicates on entire dataframe:
如果要计算整个数据帧上的重复项:
len(df)-len(df.drop_duplicates())
Or simply you can use DataFrame.duplicated(subset=None, keep='first'):
或者干脆你可以使用DataFrame.duplicated(subset=None, keep='first'):
df.duplicated(subset='one', keep='first').sum()
where
在哪里
subset: column label or sequence of labels(by default use all of the columns)
子集:列标签或标签序列(默认使用所有列)
keep: {‘first', ‘last', False}, default ‘first'
保持:{'first', 'last', False},默认为'first'
- first: Mark duplicates as True except for the first occurrence.
- last: Mark duplicates as True except for the last occurrence.
- False: Mark all duplicates as True.
- first:除第一次出现外,将重复项标记为 True。
- last: 除最后一次出现外,将重复项标记为 True。
- False:将所有重复项标记为 True。
回答by olisteadman
None of the existing answers quite offers a simple solution that returns "the number of rows that are just duplicates and should be cut out". This is a one-size-fits-all solution that does:
现有的答案都没有提供一个简单的解决方案,该解决方案返回“只是重复并且应该被删除的行数”。这是一个一刀切的解决方案,它可以:
# generate a table of those culprit rows which are duplicated:
dups = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'})
# sum the final col of that table, and subtract the number of culprits:
dups['count'].sum() - dups.shape[0]
回答by Sergey Zaitsev
I use:
我用:
used_features =[
"one",
"two",
"three"
]
df['is_duplicated'] = df.duplicated(used_features)
df['is_duplicated'].sum()
which gives count of duplicated rows, and then you can analyse them by a new column. I didn't see such solution here.
它给出了重复行的计数,然后你可以通过一个新列来分析它们。我在这里没有看到这样的解决方案。
回答by ThibTrip
ran into this problem today and wanted to include NaNs so I replace them temporarily with "" (empty string). Please comment if you do not understand something :). This solution assumes that "" is not a relevant value for you. It should also work with numerical data (I have tested it sucessfully but not extensively) since pandas will infer the data type again after replacing "" with np.nan.
今天遇到了这个问题,想包含 NaN,所以我暂时用“”(空字符串)替换了它们。如果您有不明白的地方,请发表评论:)。此解决方案假定 "" 对您来说不是相关值。它也应该适用于数值数据(我已经成功地测试了它,但没有广泛测试),因为在用 np.nan 替换 "" 后,pandas 会再次推断数据类型。
import pandas as pd
# create test data
df = pd.DataFrame({'test':['foo','bar',None,None,'foo'],
'test2':['bar',None,None,None,'bar'],
'test3':[None, 'foo','bar',None,None]})
# fill null values with '' to not lose them during groupby
# groupby all columns and calculate the length of the resulting groups
# rename the series obtained with groupby to "group_count"
# reset the index to get a DataFrame
# replace '' with np.nan (this reverts our first operation)
# sort DataFrame by "group_count" descending
df = (df.fillna('')\
.groupby(df.columns.tolist()).apply(len)\
.rename('group_count')\
.reset_index()\
.replace('',np.nan)\
.sort_values(by = ['group_count'], ascending = False))
df
test test2 test3 group_count
3 foo bar NaN 2
0 NaN NaN NaN 1
1 NaN NaN bar 1
2 bar NaN foo 1