Python Pandas：根据出现次数删除条目

Question

提问by sashkello

I'm trying to remove entries from a data frame which occur less than 100 times. The data frame datalooks like this:

我正在尝试从出现次数少于 100 次的数据框中删除条目。数据框data如下所示：

Now I count the number of tag occurrences like this:

现在我像这样计算标签出现的次数：

bytag = data.groupby('tag').aggregate(np.count_nonzero)

But then I can't figure out how to remove those entries which have low count...

但是后来我不知道如何删除那些计数低的条目......

Answer 1

回答by Andy Hayden

New in 0.12, groupby objects have a filtermethod, allowing you to do these types of operations:

0.12 中的新功能 groupby 对象有一个filter方法，允许您执行以下类型的操作：

In [11]: g = data.groupby('tag')

In [12]: g.filter(lambda x: len(x) > 1)  # pandas 0.13.1
Out[12]:
   pid  tag
1    1   45
2    1   62
4    2   45
7    3   62

The function (the first argument of filter) is applied to each group (subframe), and the results include elements of the original DataFrame belonging to groups which evaluated to True.

该函数（过滤器的第一个参数）应用于每个组（子帧），结果包括属于组的原始 DataFrame 元素，这些元素的计算结果为 True。

Note: in 0.12 the ordering is different than in the original DataFrame, this was fixed in 0.13+:

注意：在 0.12 中的排序与原始 DataFrame 中的不同，这是在 0.13+ 中修复的：

In [21]: g.filter(lambda x: len(x) > 1)  # pandas 0.12
Out[21]: 
   pid  tag
1    1   45
4    2   45
2    1   62
7    3   62

Answer 2

回答by unutbu

Edit: Thanks to @WesMcKinney for showing this much more direct way:

编辑：感谢@WesMcKinney 展示了这种更直接的方式：

data[data.groupby('tag').pid.transform(len) > 1]

import pandas
import numpy as np
data = pandas.DataFrame(
    {'pid' : [1,1,1,2,2,3,3,3],
     'tag' : [23,45,62,24,45,34,25,62],
     })

bytag = data.groupby('tag').aggregate(np.count_nonzero)
tags = bytag[bytag.pid >= 2].index
print(data[data['tag'].isin(tags)])

yields

产量

   pid  tag
1    1   45
2    1   62
4    2   45
7    3   62

Answer 3

回答by locojay

df = pd.DataFrame([(1, 2), (1, 3), (1, 4), (2, 1),(2,2,)], columns=['col1', 'col2'])

In [36]: df
Out[36]: 
   col1  col2
0     1     2
1     1     3
2     1     4
3     2     1
4     2     2

gp = df.groupby('col1').aggregate(np.count_nonzero)

In [38]: gp
Out[38]: 
      col2
col1      
1        3
2        2

lets get where the count > 2

让我们得到计数 > 2

tf = gp[gp.col2 > 2].reset_index()
df[df.col1 == tf.col1]

Out[41]: 
   col1  col2
0     1     2
1     1     3
2     1     4

Answer 4

回答by zbinsd

Here are some run times for a couple of the solutions posted here, along with one that was not (using value_counts()) that is much faster than the other solutions:

以下是此处发布的几个解决方案的一些运行时间，以及一个未（使用value_counts()）比其他解决方案快得多的解决方案：

Create the data:

创建数据：

import pandas as pd
import numpy as np

# Generate some 'users'
np.random.seed(42)
df = pd.DataFrame({'uid': np.random.randint(0, 500, 500)})

# Prove that some entries are 1
print "{:,} users only occur once in dataset".format(sum(df.uid.value_counts() == 1))

Output:

输出：

171 users only occur once in dataset

Time a few different ways of removing users with only one entry. These were run in separate cells in a Jupyter Notebook:

使用几种不同的方法来删除仅一个条目的用户。这些在 Jupyter Notebook 的不同单元中运行：

%%timeit
df.groupby(by='uid').filter(lambda x: len(x) > 1)

%%timeit
df[df.groupby('uid').uid.transform(len) > 1]

%%timeit
vc = df.uid.value_counts()
df[df.uid.isin(vc.index[vc.values > 1])].uid.value_counts()

These gave the following outputs:

这些给出了以下输出：

10 loops, best of 3: 46.2 ms per loop
10 loops, best of 3: 30.1 ms per loop
1000 loops, best of 3: 1.27 ms per loop

Python Pandas：根据出现次数删除条目

提问by sashkello

回答by Andy Hayden

回答by unutbu

回答by locojay

回答by zbinsd

Create the data:

创建数据：

Output:

输出：

Time a few different ways of removing users with only one entry. These were run in separate cells in a Jupyter Notebook:

使用几种不同的方法来删除仅一个条目的用户。这些在 Jupyter Notebook 的不同单元中运行：

These gave the following outputs:

这些给出了以下输出：

相关推荐

最近更新

标签

Python Pandas：根据出现次数删除条目

提问by sashkello

回答by Andy Hayden

回答by unutbu

回答by locojay

回答by zbinsd

Create the data:

创建数据：

Output:

输出：

Time a few different ways of removing users with only one entry. These were run in separate cells in a Jupyter Notebook:

使用几种不同的方法来删除仅一个条目的用户。这些在 Jupyter Notebook 的不同单元中运行：

These gave the following outputs:

这些给出了以下输出：

相关推荐

在 Python pandas 中将 DataFrame 添加到面板

Pandas：类似函数的 grep

在 Pandas DataFrame 中快速应用字符串操作

Pandas DataFrame 按天/小时/分钟切片

相关推荐

最近更新

标签