Python 在大熊猫数据帧中按组删除异常值的更快方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27424178/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:47:21  来源:igfitidea点击:

Faster way to remove outliers by group in large pandas DataFrame

pythonpandas

提问by ytsaig

I have a relatively large DataFrame object (about a million rows, hundreds of columns), and I'd like to clip outliers in each column by group. By "clip outliers for each column by group" I mean - compute the 5% and 95% quantiles for each column in a group and clip values outside this quantile range.

我有一个相对较大的 DataFrame 对象(大约一百万行,数百列),我想按组裁剪每列中的异常值。“按组裁剪每列的异常值”我的意思是 - 计算组中每列的 5% 和 95% 分位数,并裁剪此分位数范围之外的值。

Here's the setup I'm currently using:

这是我目前使用的设置:

def winsorize_series(s):
    q = s.quantile([0.05, 0.95])
    if isinstance(q, pd.Series) and len(q) == 2:
        s[s < q.iloc[0]] = q.iloc[0]
        s[s > q.iloc[1]] = q.iloc[1]
    return s

def winsorize_df(df):
    return df.apply(winsorize_series, axis=0)

and then, with my DataFrame called featuresand indexed by DATE, I can do

然后,通过我的 DataFrame 调用features和索引DATE,我可以做

grouped = features.groupby(level='DATE')
result = grouped.apply(winsorize_df)

This works, except that it's very slow, presumably due to the nested applycalls: one on each group, and then one for each column in each group. I tried getting rid of the second applyby computing quantiles for all columns at once, but got stuck trying to threshold each column by a different value. Is there a faster way to accomplish this procedure?

这是有效的,除了它非常慢,大概是由于嵌套apply调用:每个组一个,然后每个组中的每一列一个。我尝试apply通过一次计算所有列的分位数来摆脱第二个,但在尝试通过不同的值对每列进行阈值设置时卡住了。有没有更快的方法来完成这个过程?

回答by unutbu

There is a winsorize function in scipy.stats.mstatswhich you might consider using. Note however, that it returns slightly different values than winsorize_series:

scipy.stats.mstats 中有一个winsorize 函数,您可以考虑使用它。但是请注意,它返回的值与 略有不同winsorize_series

In [126]: winsorize_series(pd.Series(range(20), dtype='float'))[0]
Out[126]: 0.95000000000000007

In [127]: mstats.winsorize(pd.Series(range(20), dtype='float'), limits=[0.05, 0.05])[0]
Out[127]: 1.0


Using mstats.winsorizeinstead of winsorize_seriesis maybe (depending on N, M, P) ~1.5x faster:

使用mstats.winsorize而不是winsorize_series可能(取决于 N、M、P)快 1.5 倍:

import numpy as np
import pandas as pd
from scipy.stats import mstats

def using_mstats_df(df):
    return df.apply(using_mstats, axis=0)

def using_mstats(s):
    return mstats.winsorize(s, limits=[0.05, 0.05])

N, M, P = 10**5, 10, 10**2
dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P)
df = pd.DataFrame(np.random.random((N, M))
                  , index=dates)
df.index.names = ['DATE']
grouped = df.groupby(level='DATE')


In [122]: %timeit result = grouped.apply(winsorize_df)
1 loops, best of 3: 17.8 s per loop

In [123]: %timeit mstats_result = grouped.apply(using_mstats_df)
1 loops, best of 3: 11.2 s per loop

回答by mwolverine

I found a rather straightforward way to get this to work, using the transform method in pandas.

我找到了一个相当简单的方法来让它工作,使用 pandas 中的转换方法。

from scipy.stats import mstats

def winsorize_series(group):
    return mstats.winsorize(group, limits=[lower_lim,upper_lim])

grouped = features.groupby(level='DATE')
result = grouped.transform(winsorize_series)

回答by HonzaB

Good way to approach this is with vectorization. And for that, I love to use np.where.

解决这个问题的好方法是矢量化。为此,我喜欢使用np.where.

import pandas as pd
import numpy as np
from scipy.stats import mstats
import timeit

data = pd.Series(range(20), dtype='float')

def WinsorizeCustom(data):
    quantiles = data.quantile([0.05, 0.95])
    q_05 = quantiles.loc[0.05]
    q_95 = quantiles.loc[0.95]

    out = np.where(data.values <= q_05,q_05, 
                                      np.where(data >= q_95, q_95, data)
                  )
    return out

For comparison, I wrapped the function from scipyin a function:

为了比较,我将函数包装scipy在一个函数中:

def WinsorizeStats(data):
    out = mstats.winsorize(data, limits=[0.05, 0.05])
    return out

But as you can see, even though my function is pretty fast, its still far from the Scipy implementation:

但是正如你所看到的,尽管我的函数非常快,但它离 Scipy 的实现还很远:

%timeit WinsorizeCustom(data)
#1000 loops, best of 3: 842 μs per loop

%timeit WinsorizeStats(data)
#1000 loops, best of 3: 212 μs per loop

If you are interested to read more about speeding up pandas code, I would suggest Optimization Pandas for speedand From Python to Numpy.

如果您有兴趣阅读有关加速Pandas代码的更多信息,我建议您使用Optimization Pandas for speedFrom Python to Numpy

回答by tnf

Here is a solution without using scipy.stats.mstats:

这是一个不使用 scipy.stats.mstats 的解决方案:

def clip_series(s, lower, upper):
   clipped = s.clip(lower=s.quantile(lower), upper=s.quantile(upper), axis=1)
   return clipped

# Manage list of features to be winsorized
feature_list = list(features.columns)

for f in feature_list:
   features[f] = clip_series(features[f], 0.05, 0.95)