Python 计算 Pandas DataFrame 中的重复值

Question

提问by tales

There must be an easy way to do this, but I was unable to find an elegant solution for on SO or work it out by myself.

必须有一种简单的方法来做到这一点，但我无法在 SO 上找到一个优雅的解决方案，也无法自己解决。

I'm trying to count the number of duplicate values based on set of columns in a DataFrame.

我正在尝试根据 DataFrame 中的列集计算重复值的数量。

Example:

例子：

print df

    Month   LSOA code   Longitude   Latitude    Crime type
0   2015-01 E01000916   -0.106453   51.518207   Bicycle theft
1   2015-01 E01000914   -0.111497   51.518226   Burglary
2   2015-01 E01000914   -0.111497   51.518226   Burglary
3   2015-01 E01000914   -0.111497   51.518226   Other theft
4   2015-01 E01000914   -0.113767   51.517372   Theft from the person

My workaround:

我的解决方法：

counts = dict()
for i, row in df.iterrows():
    key = (
            row['Longitude'],
            row['Latitude'],
            row['Crime type']
        )

    if counts.has_key(key):
        counts[key] = counts[key] + 1
    else:
        counts[key] = 1

And I get the counts:

我得到了计数：

{(-0.11376700000000001, 51.517371999999995, 'Theft from the person'): 1,
 (-0.111497, 51.518226, 'Burglary'): 2,
 (-0.111497, 51.518226, 'Other theft'): 1,
 (-0.10645299999999999, 51.518207000000004, 'Bicycle theft'): 1}

Aside from the fact this code could be improved as well (feel free to comment how), what would be the way to do it through Pandas?

除了这个代码也可以改进（随意评论如何），通过 Pandas 做这件事的方法是什么？

For those interested I'm working on a dataset from https://data.police.uk/

对于那些感兴趣的人，我正在处理来自https://data.police.uk/的数据集

Answer 1

采纳答案by jezrael

You can use groupbywith function size. Then I reset index with rename column 0to count.

您可以使用groupby函数size。然后我将索引重命名0为count.

print df
  Month LSOA       code  Longitude   Latitude             Crime type
0    2015-01  E01000916  -0.106453  51.518207          Bicycle theft
1    2015-01  E01000914  -0.111497  51.518226               Burglary
2    2015-01  E01000914  -0.111497  51.518226               Burglary
3    2015-01  E01000914  -0.111497  51.518226            Other theft
4    2015-01  E01000914  -0.113767  51.517372  Theft from the person

df = df.groupby(['Longitude', 'Latitude', 'Crime type']).size().reset_index(name='count')
print df
   Longitude   Latitude             Crime type  count
0  -0.113767  51.517372  Theft from the person      1
1  -0.111497  51.518226               Burglary      2
2  -0.111497  51.518226            Other theft      1
3  -0.106453  51.518207          Bicycle theft      1

print df['count']
0    1
1    2
2    1
3    1
Name: count, dtype: int64

Answer 2

回答by jpp

An O(n) solution is possible via collections.Counter:

O(n) 解决方案可以通过collections.Counter：

from collections import Counter

c = Counter(list(zip(df.Longitude, df.Latitude, df.Crime_type)))

Result:

结果：

Counter({(-0.113767, 51.517372, 'Theft-from-the-person'): 1,
         (-0.111497, 51.518226, 'Burglary'): 2,
         (-0.111497, 51.518226, 'Other-theft'): 1,
         (-0.106453, 51.518207, 'Bicycle-theft'): 1})

Answer 3

回答by Alexander

You can group on Longitude and Latitude, and then use value_countson the Crime typecolumn.

您可以对经度和纬度进行分组，然后value_counts在Crime type列上使用。

df.groupby(['Longitude', 'Latitude'])['Crime type'].value_counts().to_frame('count')

                                           count
Longitude Latitude  Crime type                  
-0.113767 51.517372 Theft from the person      1
-0.111497 51.518226 Burglary                   2
                    Other theft                1
-0.106453 51.518207 Bicycle theft              1

Python 计算 Pandas DataFrame 中的重复值

提问by tales

采纳答案by jezrael

回答by jpp

回答by Alexander

相关推荐

最近更新

标签

Python 计算 Pandas DataFrame 中的重复值

提问by tales

采纳答案by jezrael

回答by jpp

回答by Alexander

相关推荐

Python 我如何在 TensorFlow 中使用批量标准化？

Python ValueError: 传递值的形状是 (3, 27)，索引意味着 (4, 27) # pandas DataFrame

Python 为什么我得到“列表”对象没有属性“项目”？

如何检查字符串是否是 Python 中的有效正则表达式？

相关推荐

最近更新

标签