Pandas：如何过滤在数据框中出现多次的项目

Question

提问by Nickpick

I have a Pandas DataFrame that contains duplicate entries. Some items are also listed twice or three times. I would like to filter it so that it only shows items that are listed at least n times. In the final table all items should only be shown once. The DataFrame contains 3 columns: [colA, colB, colC]. It should only consider colB in determining whether the item is listed multiple times. Note: this is not drop_duplicates. It's the opposite, I would like to drop items that are in the dataframe less than n times.

我有一个包含重复条目的 Pandas DataFrame。有些项目还列出了两到三遍。我想过滤它，以便它只显示至少列出 n 次的项目。在最终表格中，所有项目只应显示一次。DataFrame 包含 3 列：[colA, colB, colC]。在确定项目是否被多次列出时，它应该只考虑 colB。注意：这不是 drop_duplicates。恰恰相反，我想删除数据帧中少于 n 次的项目。

The end result should list each item only once.

最终结果应该只列出每个项目一次。

Answer 1

回答by EdChum

You can use value_countsto get the item count and then construct a boolean mask from this and reference the index and test membership using isin:

您可以使用value_counts来获取项目计数，然后从中构造一个布尔掩码，并使用以下方法引用索引和测试成员资格isin：

In [3]:
df = pd.DataFrame({'a':[0,0,0,1,2,2,3,3,3,3,3,3,4,4,4]})
df

Out[3]:
    a
0   0
1   0
2   0
3   1
4   2
5   2
6   3
7   3
8   3
9   3
10  3
11  3
12  4
13  4
14  4

In [8]:
df[df['a'].isin(df['a'].value_counts()[df['a'].value_counts()>2].index)]

Out[8]:
    a
0   0
1   0
2   0
6   3
7   3
8   3
9   3
10  3
11  3
12  4
13  4
14  4

So breaking the above down:

所以分解上述内容：

In [9]:
df['a'].value_counts() > 2

Out[9]:
3     True
4     True
0     True
2    False
1    False
Name: a, dtype: bool

In [10]:
# construct a boolean mask
df['a'].value_counts()[df['a'].value_counts()>2]

Out[10]:
3    6
4    3
0    3
Name: a, dtype: int64

In [11]:
# we're interested in the index here, pass this to isin
df['a'].value_counts()[df['a'].value_counts()>2].index

Out[11]:
Int64Index([3, 4, 0], dtype='int64')

EDIT

编辑

As user @JonClements suggested a simpler and faster method would be to groupbyon the col of interest and filterit:

正如用户@JonClements 所建议的，一种更简单、更快的方法是groupby在感兴趣的列上，filter它：

In [4]:
df.groupby('a').filter(lambda x: len(x) > 2)

Out[4]:
    a
0   0
1   0
2   0
6   3
7   3
8   3
9   3
10  3
11  3
12  4
13  4
14  4

EDIT 2

编辑 2

To get just a single entry for each repeat call drop_duplicatesand pass param subset='a':

要为每个重复调用只获取一个条目drop_duplicates并传递 param subset='a'：

In [2]:
df.groupby('a').filter(lambda x: len(x) > 2).drop_duplicates(subset='a')

Out[2]:
    a
0   0
6   3
12  4

Answer 2

回答by Alexander

First, some sample data:

首先，一些示例数据：

df = pd.DataFrame({'A': ['a'] * 4 + ['b'] * 3 + ['c'] * 2, 'B': [1] * 9})
>>> df
   A  B
0  a  1
1  a  1
2  a  1
3  a  1
4  b  1
5  b  1
6  b  1
7  c  1
8  c  1

Next, let's construct a list of values where the count exceeds some threshold:

接下来，让我们构建一个计数超过某个阈值的值列表：

from collections import Counter

threshold_count = 2
c = Counter(df.A)
relevant_items = [k for k, count in c.iteritems() if count > threshold_count]

Now, just use .locto extract the relevant items:

现在，只需用于.loc提取相关项目：

>>> df.loc[df.A.isin(relevant_items), :]
   A  B
0  a  1
1  a  1
2  a  1
3  a  1
4  b  1
5  b  1
6  b  1

Pandas：如何过滤在数据框中出现多次的项目

提问by Nickpick

回答by EdChum

回答by Alexander

相关推荐

最近更新

标签

Pandas：如何过滤在数据框中出现多次的项目

提问by Nickpick

回答by EdChum

回答by Alexander

相关推荐

pandas 如何按列拆分DataFrame

pandas 如何使用pandas将一列csv读取为dtype列表？

pandas 使用 Python 将 .csv 文件分成块

Python Pandas 使用字典映射将格式应用于数据帧中的每一列

相关推荐

最近更新

标签