Python 在大熊猫数据帧中按组删除异常值的更快方法

Question

提问by ytsaig

I have a relatively large DataFrame object (about a million rows, hundreds of columns), and I'd like to clip outliers in each column by group. By "clip outliers for each column by group" I mean - compute the 5% and 95% quantiles for each column in a group and clip values outside this quantile range.

我有一个相对较大的 DataFrame 对象（大约一百万行，数百列），我想按组裁剪每列中的异常值。“按组裁剪每列的异常值”我的意思是 - 计算组中每列的 5% 和 95% 分位数，并裁剪此分位数范围之外的值。

Here's the setup I'm currently using:

这是我目前使用的设置：

def winsorize_series(s):
    q = s.quantile([0.05, 0.95])
    if isinstance(q, pd.Series) and len(q) == 2:
        s[s < q.iloc[0]] = q.iloc[0]
        s[s > q.iloc[1]] = q.iloc[1]
    return s

def winsorize_df(df):
    return df.apply(winsorize_series, axis=0)

and then, with my DataFrame called featuresand indexed by DATE, I can do

然后，通过我的 DataFrame 调用features和索引DATE，我可以做

grouped = features.groupby(level='DATE')
result = grouped.apply(winsorize_df)

This works, except that it's very slow, presumably due to the nested applycalls: one on each group, and then one for each column in each group. I tried getting rid of the second applyby computing quantiles for all columns at once, but got stuck trying to threshold each column by a different value. Is there a faster way to accomplish this procedure?

这是有效的，除了它非常慢，大概是由于嵌套apply调用：每个组一个，然后每个组中的每一列一个。我尝试apply通过一次计算所有列的分位数来摆脱第二个，但在尝试通过不同的值对每列进行阈值设置时卡住了。有没有更快的方法来完成这个过程？

Answer 1

回答by unutbu

There is a winsorize function in scipy.stats.mstatswhich you might consider using. Note however, that it returns slightly different values than winsorize_series:

scipy.stats.mstats 中有一个winsorize 函数，您可以考虑使用它。但是请注意，它返回的值与略有不同winsorize_series：

In [126]: winsorize_series(pd.Series(range(20), dtype='float'))[0]
Out[126]: 0.95000000000000007

In [127]: mstats.winsorize(pd.Series(range(20), dtype='float'), limits=[0.05, 0.05])[0]
Out[127]: 1.0

Using mstats.winsorizeinstead of winsorize_seriesis maybe (depending on N, M, P) ~1.5x faster:

使用mstats.winsorize而不是winsorize_series可能（取决于 N、M、P）快 1.5 倍：

import numpy as np
import pandas as pd
from scipy.stats import mstats

def using_mstats_df(df):
    return df.apply(using_mstats, axis=0)

def using_mstats(s):
    return mstats.winsorize(s, limits=[0.05, 0.05])

N, M, P = 10**5, 10, 10**2
dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P)
df = pd.DataFrame(np.random.random((N, M))
                  , index=dates)
df.index.names = ['DATE']
grouped = df.groupby(level='DATE')

In [122]: %timeit result = grouped.apply(winsorize_df)
1 loops, best of 3: 17.8 s per loop

In [123]: %timeit mstats_result = grouped.apply(using_mstats_df)
1 loops, best of 3: 11.2 s per loop

Answer 2

回答by mwolverine

I found a rather straightforward way to get this to work, using the transform method in pandas.

我找到了一个相当简单的方法来让它工作，使用 pandas 中的转换方法。

from scipy.stats import mstats

def winsorize_series(group):
    return mstats.winsorize(group, limits=[lower_lim,upper_lim])

grouped = features.groupby(level='DATE')
result = grouped.transform(winsorize_series)

Answer 3

回答by HonzaB

Good way to approach this is with vectorization. And for that, I love to use np.where.

解决这个问题的好方法是矢量化。为此，我喜欢使用np.where.

import pandas as pd
import numpy as np
from scipy.stats import mstats
import timeit

data = pd.Series(range(20), dtype='float')

def WinsorizeCustom(data):
    quantiles = data.quantile([0.05, 0.95])
    q_05 = quantiles.loc[0.05]
    q_95 = quantiles.loc[0.95]

    out = np.where(data.values <= q_05,q_05, 
                                      np.where(data >= q_95, q_95, data)
                  )
    return out

For comparison, I wrapped the function from scipyin a function:

为了比较，我将函数包装scipy在一个函数中：

def WinsorizeStats(data):
    out = mstats.winsorize(data, limits=[0.05, 0.05])
    return out

But as you can see, even though my function is pretty fast, its still far from the Scipy implementation:

但是正如你所看到的，尽管我的函数非常快，但它离 Scipy 的实现还很远：

%timeit WinsorizeCustom(data)
#1000 loops, best of 3: 842 μs per loop

%timeit WinsorizeStats(data)
#1000 loops, best of 3: 212 μs per loop

If you are interested to read more about speeding up pandas code, I would suggest Optimization Pandas for speedand From Python to Numpy.

如果您有兴趣阅读有关加速Pandas代码的更多信息，我建议您使用Optimization Pandas for speed和From Python to Numpy。

Answer 4

回答by tnf

Here is a solution without using scipy.stats.mstats:

这是一个不使用 scipy.stats.mstats 的解决方案：

def clip_series(s, lower, upper):
   clipped = s.clip(lower=s.quantile(lower), upper=s.quantile(upper), axis=1)
   return clipped

# Manage list of features to be winsorized
feature_list = list(features.columns)

for f in feature_list:
   features[f] = clip_series(features[f], 0.05, 0.95)

Python 在大熊猫数据帧中按组删除异常值的更快方法

提问by ytsaig

回答by unutbu

回答by mwolverine

回答by HonzaB

回答by tnf

相关推荐

最近更新

标签

Python 在大熊猫数据帧中按组删除异常值的更快方法

提问by ytsaig

回答by unutbu

回答by mwolverine

回答by HonzaB

回答by tnf

相关推荐

Python PyQt5 中的 connect() 方法在哪里？

Python 字符串操作：将每个句子的第一个字母大写

Python：导入错误：/usr/local/lib/python2.7/lib-dynload/_io.so：未定义符号：PyUnicodeUCS2_Replace

Python Django ORM - objects.filter() 与 objects.all().filter() - 哪个更受欢迎？

相关推荐

最近更新

标签