Pandas:如何过滤在数据框中出现多次的项目
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32918506/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: How to filter for items that occur more than once in a dataframe
提问by Nickpick
I have a Pandas DataFrame that contains duplicate entries. Some items are also listed twice or three times. I would like to filter it so that it only shows items that are listed at least n times. In the final table all items should only be shown once. The DataFrame contains 3 columns: [colA, colB, colC]. It should only consider colB in determining whether the item is listed multiple times. Note: this is not drop_duplicates. It's the opposite, I would like to drop items that are in the dataframe less than n times.
我有一个包含重复条目的 Pandas DataFrame。有些项目还列出了两到三遍。我想过滤它,以便它只显示至少列出 n 次的项目。在最终表格中,所有项目只应显示一次。DataFrame 包含 3 列:[colA, colB, colC]。在确定项目是否被多次列出时,它应该只考虑 colB。注意:这不是 drop_duplicates。恰恰相反,我想删除数据帧中少于 n 次的项目。
The end result should list each item only once.
最终结果应该只列出每个项目一次。
回答by EdChum
You can use value_countsto get the item count and then construct a boolean mask from this and reference the index and test membership using isin:
您可以使用value_counts来获取项目计数,然后从中构造一个布尔掩码,并使用以下方法引用索引和测试成员资格isin:
In [3]:
df = pd.DataFrame({'a':[0,0,0,1,2,2,3,3,3,3,3,3,4,4,4]})
df
Out[3]:
a
0 0
1 0
2 0
3 1
4 2
5 2
6 3
7 3
8 3
9 3
10 3
11 3
12 4
13 4
14 4
In [8]:
df[df['a'].isin(df['a'].value_counts()[df['a'].value_counts()>2].index)]
Out[8]:
a
0 0
1 0
2 0
6 3
7 3
8 3
9 3
10 3
11 3
12 4
13 4
14 4
So breaking the above down:
所以分解上述内容:
In [9]:
df['a'].value_counts() > 2
Out[9]:
3 True
4 True
0 True
2 False
1 False
Name: a, dtype: bool
In [10]:
# construct a boolean mask
df['a'].value_counts()[df['a'].value_counts()>2]
Out[10]:
3 6
4 3
0 3
Name: a, dtype: int64
In [11]:
# we're interested in the index here, pass this to isin
df['a'].value_counts()[df['a'].value_counts()>2].index
Out[11]:
Int64Index([3, 4, 0], dtype='int64')
EDIT
编辑
As user @JonClements suggested a simpler and faster method would be to groupbyon the col of interest and filterit:
正如用户@JonClements 所建议的,一种更简单、更快的方法是groupby在感兴趣的列上,filter它:
In [4]:
df.groupby('a').filter(lambda x: len(x) > 2)
Out[4]:
a
0 0
1 0
2 0
6 3
7 3
8 3
9 3
10 3
11 3
12 4
13 4
14 4
EDIT 2
编辑 2
To get just a single entry for each repeat call drop_duplicatesand pass param subset='a':
要为每个重复调用只获取一个条目drop_duplicates并传递 param subset='a':
In [2]:
df.groupby('a').filter(lambda x: len(x) > 2).drop_duplicates(subset='a')
Out[2]:
a
0 0
6 3
12 4
回答by Alexander
First, some sample data:
首先,一些示例数据:
df = pd.DataFrame({'A': ['a'] * 4 + ['b'] * 3 + ['c'] * 2, 'B': [1] * 9})
>>> df
A B
0 a 1
1 a 1
2 a 1
3 a 1
4 b 1
5 b 1
6 b 1
7 c 1
8 c 1
Next, let's construct a list of values where the count exceeds some threshold:
接下来,让我们构建一个计数超过某个阈值的值列表:
from collections import Counter
threshold_count = 2
c = Counter(df.A)
relevant_items = [k for k, count in c.iteritems() if count > threshold_count]
Now, just use .locto extract the relevant items:
现在,只需用于.loc提取相关项目:
>>> df.loc[df.A.isin(relevant_items), :]
A B
0 a 1
1 a 1
2 a 1
3 a 1
4 b 1
5 b 1
6 b 1

