Python Pandas:从出现超过 X 次的列中获取值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22320356/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:45:04  来源:igfitidea点击:

Pandas: Get values from column that appear more than X times

pythonpandas

提问by Robin

I have a data frame in pandas and would like to get all the values of a certain column that appear more than X times. I know this should be easy but somehow I am not getting anywhere with my current attempts.

我在 Pandas 中有一个数据框,想获取某个列的所有值出现次数超过 X 次。我知道这应该很容易,但不知何故,我目前的尝试没有取得任何进展。

Here is an example:

下面是一个例子:

>>> df2 = pd.DataFrame([{"uid": 0, "mi":1}, {"uid": 0, "mi":2}, {"uid": 0, "mi":1}, {"uid": 0, "mi":1}])
>>> df2

    mi  uid
0    1   0
1    2   0
2    1   0
3    1   0

Now supposed I want to get all values from column "mi" that appear more than 2 times, the result should be

现在假设我想从“mi”列中获取出现超过 2 次的所有值,结果应该是

>>> <fancy query>
array([1])

I have tried a couple of things with groupby and count but I always end up with a series with the values and their respective counts but don't know how to extract the values that have count more than X from that:

我已经尝试了 groupby 和 count 的一些东西,但我总是以一系列的值和它们各自的计数结束,但不知道如何从中提取计数超过 X 的值:

>>> df2.groupby('mi').mi.count() > 2
mi
1      True
2     False
dtype: bool

But how can I use this now to get the values of mi that are true?

但是我现在如何使用它来获得真实的 mi 值?

Any hints appreciated :)

任何提示表示赞赏:)

回答by juniper-

Or how about this:

或者这个怎​​么样:

Create the table:

创建表:

>>> import pandas as pd
>>> df2 = pd.DataFrame([{"uid": 0, "mi":1}, {"uid": 0, "mi":2}, {"uid": 0, "mi":1}, {"uid": 0, "mi":1}])

Get the counts of each occurance:

获取每次出现的次数:

>>> vc = df2.mi.value_counts()
>>> print vc
1    3
2    1

Print out those that occur more than 2 times:

打印出出现超过 2 次的那些:

>>> print vc[vc > 2].index[0]
1

回答by nicolaskruchten

I use this:

我用这个:

 df2.mi.value_counts().reset_index(name="count").query("count > 5")["index"]

The part before query()gives me a data frame with two columns: indexand count. The query()filters on countand then we pull out the values.

之前的部分query()给了我一个包含两列的数据框:indexcount。该query()过滤器上count,然后我们拉出值。

回答by A.Kot

from collections import Counter

counts = Counter(df2.mi)
df2[df2.mi.isin([key for key in counts if counts[key] > 2])]

回答by branwen85

I found a problem with the solution provided by @juniper- If there are more than 2 values fulfilling your condition, they will not be printed out. For example:

我发现@juniper 提供的解决方案存在问题- 如果有 2 个以上的值满足您的条件,它们将不会被打印出来。例如:

>>> check=pd.DataFrame({'YOB':[1991,1992,1993,1991,1995,1994,1992,1991]})

>>>vc = check.YOB.value_counts()
>>>vc
1991    3
1992    2
1995    1
1994    1
1993    1
Name: YOB, dtype: int64

Let's say we want to find years which appear more than once:

假设我们想要找到多次出现的年份:

>>>vc[vc>1]
1991    3
1992    2
Name: YOB, dtype: int64

If we now want to access the actual value, we need to do:

如果我们现在想要访问实际值,我们需要这样做:

>>>vc[vc>1].index.tolist()
[1991,1992]

Rather than call it by index, which will print out the first value only:

而不是通过索引调用它,它只会打印出第一个值:

>>>vc[vc>1].index[0]
1991

回答by mbh86

Similar to @nicolaskruchten, slightly shorter version

类似于@nicolaskruchten,略短的版本

 df2.mi.value_counts().loc[lambda x: x>5].reset_index()['index']

And if you don't need to have the result within a serie, just do this:

如果您不需要在系列中获得结果,只需执行以下操作:

df2.mi.value_counts().loc[lambda x: x>5].index