Python Pandas：从出现超过 X 次的列中获取值

Question

提问by Robin

I have a data frame in pandas and would like to get all the values of a certain column that appear more than X times. I know this should be easy but somehow I am not getting anywhere with my current attempts.

我在 Pandas 中有一个数据框，想获取某个列的所有值出现次数超过 X 次。我知道这应该很容易，但不知何故，我目前的尝试没有取得任何进展。

Here is an example:

下面是一个例子：

>>> df2 = pd.DataFrame([{"uid": 0, "mi":1}, {"uid": 0, "mi":2}, {"uid": 0, "mi":1}, {"uid": 0, "mi":1}])
>>> df2

    mi  uid
0    1   0
1    2   0
2    1   0
3    1   0

Now supposed I want to get all values from column "mi" that appear more than 2 times, the result should be

现在假设我想从“mi”列中获取出现超过 2 次的所有值，结果应该是

>>> <fancy query>
array([1])

I have tried a couple of things with groupby and count but I always end up with a series with the values and their respective counts but don't know how to extract the values that have count more than X from that:

我已经尝试了 groupby 和 count 的一些东西，但我总是以一系列的值和它们各自的计数结束，但不知道如何从中提取计数超过 X 的值：

>>> df2.groupby('mi').mi.count() > 2
mi
1      True
2     False
dtype: bool

But how can I use this now to get the values of mi that are true?

但是我现在如何使用它来获得真实的 mi 值？

Any hints appreciated :)

任何提示表示赞赏:)

Answer 1

回答by juniper-

Or how about this:

或者这个怎么样：

Create the table:

创建表：

>>> import pandas as pd
>>> df2 = pd.DataFrame([{"uid": 0, "mi":1}, {"uid": 0, "mi":2}, {"uid": 0, "mi":1}, {"uid": 0, "mi":1}])

Get the counts of each occurance:

获取每次出现的次数：

>>> vc = df2.mi.value_counts()
>>> print vc
1    3
2    1

Print out those that occur more than 2 times:

打印出出现超过 2 次的那些：

>>> print vc[vc > 2].index[0]
1

Answer 2

回答by nicolaskruchten

I use this:

我用这个：

 df2.mi.value_counts().reset_index(name="count").query("count > 5")["index"]

The part before query()gives me a data frame with two columns: indexand count. The query()filters on countand then we pull out the values.

之前的部分query()给了我一个包含两列的数据框：index和count。该query()过滤器上count，然后我们拉出值。

Answer 3

回答by A.Kot

from collections import Counter

counts = Counter(df2.mi)
df2[df2.mi.isin([key for key in counts if counts[key] > 2])]

Answer 4

回答by branwen85

I found a problem with the solution provided by @juniper- If there are more than 2 values fulfilling your condition, they will not be printed out. For example:

我发现@juniper 提供的解决方案存在问题- 如果有 2 个以上的值满足您的条件，它们将不会被打印出来。例如：

>>> check=pd.DataFrame({'YOB':[1991,1992,1993,1991,1995,1994,1992,1991]})

>>>vc = check.YOB.value_counts()
>>>vc
1991    3
1992    2
1995    1
1994    1
1993    1
Name: YOB, dtype: int64

Let's say we want to find years which appear more than once:

假设我们想要找到多次出现的年份：

>>>vc[vc>1]
1991    3
1992    2
Name: YOB, dtype: int64

If we now want to access the actual value, we need to do:

如果我们现在想要访问实际值，我们需要这样做：

>>>vc[vc>1].index.tolist()
[1991,1992]

Rather than call it by index, which will print out the first value only:

而不是通过索引调用它，它只会打印出第一个值：

>>>vc[vc>1].index[0]
1991

Answer 5

回答by mbh86

Similar to @nicolaskruchten, slightly shorter version

类似于@nicolaskruchten，略短的版本

 df2.mi.value_counts().loc[lambda x: x>5].reset_index()['index']

And if you don't need to have the result within a serie, just do this:

如果您不需要在系列中获得结果，只需执行以下操作：

df2.mi.value_counts().loc[lambda x: x>5].index

Python Pandas：从出现超过 X 次的列中获取值

提问by Robin

回答by juniper-

回答by nicolaskruchten

回答by A.Kot

回答by branwen85

回答by mbh86

相关推荐

最近更新

标签

Python Pandas：从出现超过 X 次的列中获取值

提问by Robin

回答by juniper-

回答by nicolaskruchten

回答by A.Kot

回答by branwen85

回答by mbh86

相关推荐

如何在python中获得高斯滤波器

Python py.test：错误：无法识别的参数：--cov=ner_brands --cov-report=term-missing --cov-config

Python 在循环中跳过多次迭代

NameError：全局名称“xrange”未在 Python 3 中定义

相关推荐

最近更新

标签