Python 计算 Pandas DataFrame 中的重复值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33993592/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Counting duplicate values in Pandas DataFrame
提问by tales
There must be an easy way to do this, but I was unable to find an elegant solution for on SO or work it out by myself.
必须有一种简单的方法来做到这一点,但我无法在 SO 上找到一个优雅的解决方案,也无法自己解决。
I'm trying to count the number of duplicate values based on set of columns in a DataFrame.
我正在尝试根据 DataFrame 中的列集计算重复值的数量。
Example:
例子:
print df
Month LSOA code Longitude Latitude Crime type
0 2015-01 E01000916 -0.106453 51.518207 Bicycle theft
1 2015-01 E01000914 -0.111497 51.518226 Burglary
2 2015-01 E01000914 -0.111497 51.518226 Burglary
3 2015-01 E01000914 -0.111497 51.518226 Other theft
4 2015-01 E01000914 -0.113767 51.517372 Theft from the person
My workaround:
我的解决方法:
counts = dict()
for i, row in df.iterrows():
key = (
row['Longitude'],
row['Latitude'],
row['Crime type']
)
if counts.has_key(key):
counts[key] = counts[key] + 1
else:
counts[key] = 1
And I get the counts:
我得到了计数:
{(-0.11376700000000001, 51.517371999999995, 'Theft from the person'): 1,
(-0.111497, 51.518226, 'Burglary'): 2,
(-0.111497, 51.518226, 'Other theft'): 1,
(-0.10645299999999999, 51.518207000000004, 'Bicycle theft'): 1}
Aside from the fact this code could be improved as well (feel free to comment how), what would be the way to do it through Pandas?
除了这个代码也可以改进(随意评论如何),通过 Pandas 做这件事的方法是什么?
For those interested I'm working on a dataset from https://data.police.uk/
对于那些感兴趣的人,我正在处理来自https://data.police.uk/的数据集
采纳答案by jezrael
You can use groupby
with function size.
Then I reset index with rename column 0
to count
.
您可以使用groupby
函数size。然后我将索引重命名0
为count
.
print df
Month LSOA code Longitude Latitude Crime type
0 2015-01 E01000916 -0.106453 51.518207 Bicycle theft
1 2015-01 E01000914 -0.111497 51.518226 Burglary
2 2015-01 E01000914 -0.111497 51.518226 Burglary
3 2015-01 E01000914 -0.111497 51.518226 Other theft
4 2015-01 E01000914 -0.113767 51.517372 Theft from the person
df = df.groupby(['Longitude', 'Latitude', 'Crime type']).size().reset_index(name='count')
print df
Longitude Latitude Crime type count
0 -0.113767 51.517372 Theft from the person 1
1 -0.111497 51.518226 Burglary 2
2 -0.111497 51.518226 Other theft 1
3 -0.106453 51.518207 Bicycle theft 1
print df['count']
0 1
1 2
2 1
3 1
Name: count, dtype: int64
回答by jpp
An O(n) solution is possible via collections.Counter
:
O(n) 解决方案可以通过collections.Counter
:
from collections import Counter
c = Counter(list(zip(df.Longitude, df.Latitude, df.Crime_type)))
Result:
结果:
Counter({(-0.113767, 51.517372, 'Theft-from-the-person'): 1,
(-0.111497, 51.518226, 'Burglary'): 2,
(-0.111497, 51.518226, 'Other-theft'): 1,
(-0.106453, 51.518207, 'Bicycle-theft'): 1})
回答by Alexander
You can group on Longitude and Latitude, and then use value_counts
on the Crime type
column.
您可以对经度和纬度进行分组,然后value_counts
在Crime type
列上使用。
df.groupby(['Longitude', 'Latitude'])['Crime type'].value_counts().to_frame('count')
count
Longitude Latitude Crime type
-0.113767 51.517372 Theft from the person 1
-0.111497 51.518226 Burglary 2
Other theft 1
-0.106453 51.518207 Bicycle theft 1