Python Pandas:根据出现次数删除条目

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13446480/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:29:54  来源:igfitidea点击:

Python Pandas: remove entries based on the number of occurrences

pythonnumpypython-2.7pandas

提问by sashkello

I'm trying to remove entries from a data frame which occur less than 100 times. The data frame datalooks like this:

我正在尝试从出现次数少于 100 次的数据框中删除条目。数据框data如下所示:

pid   tag
1     23    
1     45
1     62
2     24
2     45
3     34
3     25
3     62

Now I count the number of tag occurrences like this:

现在我像这样计算标签出现的次数:

bytag = data.groupby('tag').aggregate(np.count_nonzero)

But then I can't figure out how to remove those entries which have low count...

但是后来我不知道如何删除那些计数低的条目......

回答by Andy Hayden

New in 0.12, groupby objects have a filtermethod, allowing you to do these types of operations:

0.12 中的新功能 groupby 对象有一个filter方法,允许您执行以下类型的操作:

In [11]: g = data.groupby('tag')

In [12]: g.filter(lambda x: len(x) > 1)  # pandas 0.13.1
Out[12]:
   pid  tag
1    1   45
2    1   62
4    2   45
7    3   62

The function (the first argument of filter) is applied to each group (subframe), and the results include elements of the original DataFrame belonging to groups which evaluated to True.

该函数(过滤器的第一个参数)应用于每个组(子帧),结果包括属于组的原始 DataFrame 元素,这些元素的计算结果为 True。

Note: in 0.12 the ordering is different than in the original DataFrame, this was fixed in 0.13+:

注意:在 0.12 中的排序与原始 DataFrame 中的不同,这是在 0.13+ 中修复的:

In [21]: g.filter(lambda x: len(x) > 1)  # pandas 0.12
Out[21]: 
   pid  tag
1    1   45
4    2   45
2    1   62
7    3   62

回答by unutbu

Edit: Thanks to @WesMcKinney for showing this much more direct way:

编辑:感谢@WesMcKinney 展示了这种更直接的方式:

data[data.groupby('tag').pid.transform(len) > 1]


import pandas
import numpy as np
data = pandas.DataFrame(
    {'pid' : [1,1,1,2,2,3,3,3],
     'tag' : [23,45,62,24,45,34,25,62],
     })

bytag = data.groupby('tag').aggregate(np.count_nonzero)
tags = bytag[bytag.pid >= 2].index
print(data[data['tag'].isin(tags)])

yields

产量

   pid  tag
1    1   45
2    1   62
4    2   45
7    3   62

回答by locojay

df = pd.DataFrame([(1, 2), (1, 3), (1, 4), (2, 1),(2,2,)], columns=['col1', 'col2'])

In [36]: df
Out[36]: 
   col1  col2
0     1     2
1     1     3
2     1     4
3     2     1
4     2     2

gp = df.groupby('col1').aggregate(np.count_nonzero)

In [38]: gp
Out[38]: 
      col2
col1      
1        3
2        2

lets get where the count > 2

让我们得到计数 > 2

tf = gp[gp.col2 > 2].reset_index()
df[df.col1 == tf.col1]

Out[41]: 
   col1  col2
0     1     2
1     1     3
2     1     4

回答by zbinsd

Here are some run times for a couple of the solutions posted here, along with one that was not (using value_counts()) that is much faster than the other solutions:

以下是此处发布的几个解决方案的一些运行时间,以及一个未(使用value_counts())比其他解决方案快得多的解决方案:

Create the data:

创建数据:

import pandas as pd
import numpy as np

# Generate some 'users'
np.random.seed(42)
df = pd.DataFrame({'uid': np.random.randint(0, 500, 500)})

# Prove that some entries are 1
print "{:,} users only occur once in dataset".format(sum(df.uid.value_counts() == 1))

Output:

输出:

171 users only occur once in dataset

171 users only occur once in dataset

Time a few different ways of removing users with only one entry. These were run in separate cells in a Jupyter Notebook:

使用几种不同的方法来删除仅一个条目的用户。这些在 Jupyter Notebook 的不同单元中运行:

%%timeit
df.groupby(by='uid').filter(lambda x: len(x) > 1)

%%timeit
df[df.groupby('uid').uid.transform(len) > 1]

%%timeit
vc = df.uid.value_counts()
df[df.uid.isin(vc.index[vc.values > 1])].uid.value_counts()

These gave the following outputs:

这些给出了以下输出:

10 loops, best of 3: 46.2 ms per loop
10 loops, best of 3: 30.1 ms per loop
1000 loops, best of 3: 1.27 ms per loop