Python pandas：排除低于特定频率计数的行

Question

提问by Wes Field

So I have a pandas DataFrame that looks like this:

所以我有一个如下所示的 Pandas DataFrame：

r vals    positions
1.2       1
1.8       2
2.3       1
1.8       1
2.1       3
2.0       3
1.9       1
...       ...

I would like the filter out all rows by position that do not appear at least 20 times. I have seen something like this

我希望按位置过滤掉所有不出现至少 20 次的行。我见过这样的

g=df.groupby('positions')
g.filter(lambda x: len(x) > 20)

but this does not seem to work and I do not understand how to get the original dataframe back from this. Thanks in advance for the help.

但这似乎不起作用，我不明白如何从中获取原始数据帧。在此先感谢您的帮助。

Answer 1

回答by EdChum

On your limited dataset the following works:

在您有限的数据集上，以下工作有效：

In [125]:
df.groupby('positions')['r vals'].filter(lambda x: len(x) >= 3)

Out[125]:
0    1.2
2    2.3
3    1.8
6    1.9
Name: r vals, dtype: float64

You can assign the result of this filter and use this with isinto filter your orig df:

您可以分配此过滤器的结果并使用它isin来过滤您的原始文件：

In [129]:
filtered = df.groupby('positions')['r vals'].filter(lambda x: len(x) >= 3)
df[df['r vals'].isin(filtered)]

Out[129]:
   r vals  positions
0     1.2          1
1     1.8          2
2     2.3          1
3     1.8          1
6     1.9          1

You just need to change 3to 20in your case

你只需要在你的情况下3改为20

Another approach would be to use value_countsto create an aggregate series, we can then use this to filter your df:

另一种方法是使用value_counts创建聚合系列，然后我们可以使用它来过滤您的 df：

In [136]:
counts = df['positions'].value_counts()
counts

Out[136]:
1    4
3    2
2    1
dtype: int64

In [137]:
counts[counts > 3]

Out[137]:
1    4
dtype: int64

In [135]:
df[df['positions'].isin(counts[counts > 3].index)]

Out[135]:
   r vals  positions
0     1.2          1
2     2.3          1
3     1.8          1
6     1.9          1

EDIT

编辑

If you want to filter the groupby object on the dataframe rather than a Series then you can call filteron the groupby object directly:

如果要过滤数据帧上的 groupby 对象而不是系列，则可以filter直接调用groupby 对象：

In [139]:
filtered = df.groupby('positions').filter(lambda x: len(x) >= 3)
filtered

Out[139]:
   r vals  positions
0     1.2          1
2     2.3          1
3     1.8          1
6     1.9          1

Answer 2

回答by Piotr Dabkowski

I like the following method:

我喜欢以下方法：

def filter_by_freq(df: pd.DataFrame, column: str, min_freq: int) -> pd.DataFrame:
    """Filters the DataFrame based on the value frequency in the specified column.

    :param df: DataFrame to be filtered.
    :param column: Column name that should be frequency filtered.
    :param min_freq: Minimal value frequency for the row to be accepted.
    :return: Frequency filtered DataFrame.
    """
    # Frequencies of each value in the column.
    freq = df[column].value_counts()
    # Select frequent values. Value is in the index.
    frequent_values = freq[freq >= min_freq].index
    # Return only rows with value frequency above threshold.
    return df[df[column].isin(frequent_values)]

It is much faster than the filter lambda method in the accepted answer - python overhead is minimised.

它比接受的答案中的过滤器 lambda 方法快得多——python 开销被最小化。

Answer 3

回答by Paul Jtheitroademan

How about selecting all positionrows with values >= 20

如何选择position值 >= 20 的所有行

mask = df['position'] >= 20
sel = df.ix[mask, :]

Python pandas：排除低于特定频率计数的行

提问by Wes Field

回答by EdChum

回答by Piotr Dabkowski

回答by Paul Jtheitroademan

相关推荐

最近更新

标签

Python pandas：排除低于特定频率计数的行

提问by Wes Field

回答by EdChum

回答by Piotr Dabkowski

回答by Paul Jtheitroademan

相关推荐

pandas 合并字典中的数据框

pandas 使用pandas.to_csv到YYYY-MM-DD时如何指定日期格式？

使用 Pandas 读取 CSV 文件：复杂分隔符

Python Pandas 中的慢速随机实现

相关推荐

最近更新

标签