Python pandas:排除低于特定频率计数的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/30485151/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python pandas: exclude rows below a certain frequency count
提问by Wes Field
So I have a pandas DataFrame that looks like this:
所以我有一个如下所示的 Pandas DataFrame:
r vals    positions
1.2       1
1.8       2
2.3       1
1.8       1
2.1       3
2.0       3
1.9       1
...       ...
I would like the filter out all rows by position that do not appear at least 20 times. I have seen something like this
我希望按位置过滤掉所有不出现至少 20 次的行。我见过这样的
g=df.groupby('positions')
g.filter(lambda x: len(x) > 20)
but this does not seem to work and I do not understand how to get the original dataframe back from this. Thanks in advance for the help.
但这似乎不起作用,我不明白如何从中获取原始数据帧。在此先感谢您的帮助。
回答by EdChum
On your limited dataset the following works:
在您有限的数据集上,以下工作有效:
In [125]:
df.groupby('positions')['r vals'].filter(lambda x: len(x) >= 3)
Out[125]:
0    1.2
2    2.3
3    1.8
6    1.9
Name: r vals, dtype: float64
You can assign the result of this filter and use this with isinto filter your orig df:
您可以分配此过滤器的结果并使用它isin来过滤您的原始文件:
In [129]:
filtered = df.groupby('positions')['r vals'].filter(lambda x: len(x) >= 3)
df[df['r vals'].isin(filtered)]
Out[129]:
   r vals  positions
0     1.2          1
1     1.8          2
2     2.3          1
3     1.8          1
6     1.9          1
You just need to change 3to 20in your case
你只需要在你的情况下3改为20
Another approach would be to use value_countsto create an aggregate series, we can then use this to filter your df:
另一种方法是使用value_counts创建聚合系列,然后我们可以使用它来过滤您的 df:
In [136]:
counts = df['positions'].value_counts()
counts
Out[136]:
1    4
3    2
2    1
dtype: int64
In [137]:
counts[counts > 3]
Out[137]:
1    4
dtype: int64
In [135]:
df[df['positions'].isin(counts[counts > 3].index)]
Out[135]:
   r vals  positions
0     1.2          1
2     2.3          1
3     1.8          1
6     1.9          1
EDIT
编辑
If you want to filter the groupby object on the dataframe rather than a Series then you can call filteron the groupby object directly:
如果要过滤数据帧上的 groupby 对象而不是系列,则可以filter直接调用groupby 对象:
In [139]:
filtered = df.groupby('positions').filter(lambda x: len(x) >= 3)
filtered
Out[139]:
   r vals  positions
0     1.2          1
2     2.3          1
3     1.8          1
6     1.9          1
回答by Piotr Dabkowski
I like the following method:
我喜欢以下方法:
def filter_by_freq(df: pd.DataFrame, column: str, min_freq: int) -> pd.DataFrame:
    """Filters the DataFrame based on the value frequency in the specified column.
    :param df: DataFrame to be filtered.
    :param column: Column name that should be frequency filtered.
    :param min_freq: Minimal value frequency for the row to be accepted.
    :return: Frequency filtered DataFrame.
    """
    # Frequencies of each value in the column.
    freq = df[column].value_counts()
    # Select frequent values. Value is in the index.
    frequent_values = freq[freq >= min_freq].index
    # Return only rows with value frequency above threshold.
    return df[df[column].isin(frequent_values)]
It is much faster than the filter lambda method in the accepted answer - python overhead is minimised.
它比接受的答案中的过滤器 lambda 方法快得多——python 开销被最小化。
回答by Paul Jtheitroademan
How about selecting all positionrows with values >= 20
如何选择position值 >= 20 的所有行
mask = df['position'] >= 20
sel = df.ix[mask, :]

