Python Pandas:从出现超过 X 次的列中获取值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22320356/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: Get values from column that appear more than X times
提问by Robin
I have a data frame in pandas and would like to get all the values of a certain column that appear more than X times. I know this should be easy but somehow I am not getting anywhere with my current attempts.
我在 Pandas 中有一个数据框,想获取某个列的所有值出现次数超过 X 次。我知道这应该很容易,但不知何故,我目前的尝试没有取得任何进展。
Here is an example:
下面是一个例子:
>>> df2 = pd.DataFrame([{"uid": 0, "mi":1}, {"uid": 0, "mi":2}, {"uid": 0, "mi":1}, {"uid": 0, "mi":1}])
>>> df2
mi uid
0 1 0
1 2 0
2 1 0
3 1 0
Now supposed I want to get all values from column "mi" that appear more than 2 times, the result should be
现在假设我想从“mi”列中获取出现超过 2 次的所有值,结果应该是
>>> <fancy query>
array([1])
I have tried a couple of things with groupby and count but I always end up with a series with the values and their respective counts but don't know how to extract the values that have count more than X from that:
我已经尝试了 groupby 和 count 的一些东西,但我总是以一系列的值和它们各自的计数结束,但不知道如何从中提取计数超过 X 的值:
>>> df2.groupby('mi').mi.count() > 2
mi
1 True
2 False
dtype: bool
But how can I use this now to get the values of mi that are true?
但是我现在如何使用它来获得真实的 mi 值?
Any hints appreciated :)
任何提示表示赞赏:)
回答by juniper-
Or how about this:
或者这个怎么样:
Create the table:
创建表:
>>> import pandas as pd
>>> df2 = pd.DataFrame([{"uid": 0, "mi":1}, {"uid": 0, "mi":2}, {"uid": 0, "mi":1}, {"uid": 0, "mi":1}])
Get the counts of each occurance:
获取每次出现的次数:
>>> vc = df2.mi.value_counts()
>>> print vc
1 3
2 1
Print out those that occur more than 2 times:
打印出出现超过 2 次的那些:
>>> print vc[vc > 2].index[0]
1
回答by nicolaskruchten
I use this:
我用这个:
df2.mi.value_counts().reset_index(name="count").query("count > 5")["index"]
The part before query()gives me a data frame with two columns: indexand count. The query()filters on countand then we pull out the values.
之前的部分query()给了我一个包含两列的数据框:index和count。该query()过滤器上count,然后我们拉出值。
回答by A.Kot
from collections import Counter
counts = Counter(df2.mi)
df2[df2.mi.isin([key for key in counts if counts[key] > 2])]
回答by branwen85
I found a problem with the solution provided by @juniper- If there are more than 2 values fulfilling your condition, they will not be printed out. For example:
我发现@juniper 提供的解决方案存在问题- 如果有 2 个以上的值满足您的条件,它们将不会被打印出来。例如:
>>> check=pd.DataFrame({'YOB':[1991,1992,1993,1991,1995,1994,1992,1991]})
>>>vc = check.YOB.value_counts()
>>>vc
1991 3
1992 2
1995 1
1994 1
1993 1
Name: YOB, dtype: int64
Let's say we want to find years which appear more than once:
假设我们想要找到多次出现的年份:
>>>vc[vc>1]
1991 3
1992 2
Name: YOB, dtype: int64
If we now want to access the actual value, we need to do:
如果我们现在想要访问实际值,我们需要这样做:
>>>vc[vc>1].index.tolist()
[1991,1992]
Rather than call it by index, which will print out the first value only:
而不是通过索引调用它,它只会打印出第一个值:
>>>vc[vc>1].index[0]
1991
回答by mbh86
Similar to @nicolaskruchten, slightly shorter version
类似于@nicolaskruchten,略短的版本
df2.mi.value_counts().loc[lambda x: x>5].reset_index()['index']
And if you don't need to have the result within a serie, just do this:
如果您不需要在系列中获得结果,只需执行以下操作:
df2.mi.value_counts().loc[lambda x: x>5].index

