pandas python:计算列中重复条目的数量

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48713347/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:09:03  来源:igfitidea点击:

python: count number of duplicate entries in column

pythonpandascount

提问by freddy888

I have the following df:

我有以下 df:

date       id 
2000        1
2001        1 
2002        1
2000        2
2001        2
2002        2
2000        1
2001        1
2002        1

I want to count per date how many duplicates of id there are. The result should look like this because on every date the id 1 exists twice:

我想计算每个日期有多少 id 重复。结果应该是这样的,因为在每个日期 id 1 都存在两次:

date       id        count
2000        1          2
2001        1          2
2002        1          2
2000        2          2
2001        2          2
2002        2          2
2000        1          2
2001        1          2
2002        1          2

I tried something like this, but this gives me 1s when id is 2.

我尝试过这样的事情,但是当 id 为 2 时,这给了我 1s。

df["count"] = df.groupby(["date", "id"])["count"].transform("count")

回答by BrokenRobot

The problem with your original code was a simple fix.

原始代码的问题是一个简单的修复。

df['count'] = df.groupby(['date', 'id']).transform('count')

If I use group and transform it to a new column it will result in:

如果我使用 group 并将其转换为新列,它将导致:

df = pd.DataFrame(np.random.randint(0,3,size=(10, 3)), columns=['A', 'B', 'C'])
df['count'] = df.groupby(['A', 'B'])['C'].transform('count')
print(df)

Resulting in:

导致:

   A  B  C  count
0  1  2  0      1
1  0  0  0      2
2  2  0  2      4
3  2  0  1      4
4  2  0  2      4
5  2  0  1      4
6  0  0  0      2
7  2  2  0      3
8  2  2  1      3
9  2  2  2      3

回答by YOBEN_S

You can using duplicated

你可以使用 duplicated

df.groupby('date').id.transform(lambda x : x.duplicated(keep=False).sum())
Out[208]: 
0    2
1    2
2    2
3    2
4    2
5    2
6    2
7    2
8    2
Name: id, dtype: int64

回答by Riley J. Graham

Another simple solution: Try combining columns for date and ID into a third column "date"+"ID". Now you can use count to find the number of duplicates for each entry in the new 3rd column.

另一个简单的解决方案:尝试将日期和 ID 列组合到第三列“日期”+“ID”中。现在,您可以使用 count 来查找新的第 3 列中每个条目的重复项数。

>>> dateID = [20001,20011,20021,20002,20012,20022,...]
>>> dateID.count("20001")
>>> 2
>>> dateID.count("20002")
>>> 2

You can count occurrences of each item in dateID using

您可以使用以下方法计算 dateID 中每个项目的出现次数

[[x,dateID.count(x)] for x in set(dateID)]

Perhaps even easier, is to use counter:

也许更简单的是使用计数器:

>>> dateID=[x,y,z,x,y,z,z]
>>> from collections import Counter
>>> counter(dateID)
Counter({'x': 2, 'y': 2, 'z': 3})