pandas 如何采取地板和上限以去除异常值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42207920/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:58:14  来源:igfitidea点击:

how to take floor and capping for removing outliers

pythonpandas

提问by Gavin

How to calculate 99% and 1% percentile as cap and floor for each column, the if value >= 99% percentile then redefine the value as the value of 99% percentile; similarly if value <= 1% percentile then redefine value as the value of 1% percentile

如何计算每列的 99% 和 1% 百分位数作为上限和下限,如果值 >= 99% 百分位数,则将该值重新定义为 99% 百分位数的值;同样,如果 value <= 1% 百分位数,则将 value 重新定义为 1% 百分位数的值

np.random.seed(2)
df = pd.DataFrame({'value1': np.random.randn(100), 'value2': np.random.randn(100)})
df['lrnval'] = np.where(np.random.random(df.shape[0])>=0.7, 'learning', 'validation')

if we have hundreds columns, can we use apply function instead of do loop?

如果我们有数百列,我们可以使用 apply 函数而不是 do 循环吗?

采纳答案by lleiou

Based on Abdou's answer, the following might save you some time:

根据 Abdou 的回答,以下内容可能会为您节省一些时间:

for col in df.columns:
    percentiles = df[col].quantile([0.01,0.99]).values
    df[col][df[col] <= percentiles[0]] = percentiles[0]
    df[col][df[col] >= percentiles[1]] = percentiles[1]

or using numpy.clip:

或使用numpy.clip

import numpy as np
for col in df.columns:
    percentiles = df[col].quantile([0.01,0.99]).values
    df[col] = np.clip(df[col], percentiles[0], percentiles[1])

回答by Abdou

You can first define a helper function that takes in as arguments a seriesand a valueand changes that value according to the conditions mentioned above:

您可以首先定义一个辅助函数,该函数接受作为参数 aseries和 avalue并根据上述条件更改该值:

def scale_val(s, val):
    percentiles = s.quantile([0.01,0.99]).values
    if val <= percentiles[0]:
        return percentiles[0]
    elif val >= percentiles[1]:
        return percentiles[1]
    else:
        return val

Then you can use pd.DataFrame.applyand pd.Series.apply:

然后你可以使用pd.DataFrame.applypd.Series.apply

df.apply(lambda s: s.apply(lambda v: scale_val(s,v)))

Please note that this may be a somewhat slow solution if you are dealing with a large amount of data, but I would suggest you give a shot and see if it will solve your problem within a reasonable time.

请注意,如果您正在处理大量数据,这可能是一个有点慢的解决方案,但我建议您试一试,看看它是否会在合理的时间内解决您的问题。

Edit:

编辑:

If you only want to get the percentiles for rows of dfwhere the column lrnvalis equal to "learning", you can modify the function to calculate the percentiles for only rows where that condition is true:

如果您只想获取dflrnval等于“学习”的行的百分位数,您可以修改函数以仅计算该条件为真的行的百分位数:

def scale_val2(s, val):
    percentiles = s[df.lrnval.eq('learning')].quantile([0.01,0.99]).values
    if val <= percentiles[0]:
        return percentiles[0]
    elif val >= percentiles[1]:
        return percentiles[1]
    else:
        return val

Since there is a column that contains strings, I assume that you won't be doing any calculations on it. So, I would change the code as follows:

由于有一列包含字符串,我假设您不会对其进行任何计算。因此,我将更改代码如下:

df.filter(regex='[^lrnval]').apply(lambda s: s.apply(lambda v: scale_val2(s,v)))

I hope this proves useful.

我希望这证明是有用的。