pandas 如何采取地板和上限以去除异常值

Question

提问by Gavin

How to calculate 99% and 1% percentile as cap and floor for each column, the if value >= 99% percentile then redefine the value as the value of 99% percentile; similarly if value <= 1% percentile then redefine value as the value of 1% percentile

如何计算每列的 99% 和 1% 百分位数作为上限和下限，如果值 >= 99% 百分位数，则将该值重新定义为 99% 百分位数的值；同样，如果 value <= 1% 百分位数，则将 value 重新定义为 1% 百分位数的值

np.random.seed(2)
df = pd.DataFrame({'value1': np.random.randn(100), 'value2': np.random.randn(100)})
df['lrnval'] = np.where(np.random.random(df.shape[0])>=0.7, 'learning', 'validation')

if we have hundreds columns, can we use apply function instead of do loop?

如果我们有数百列，我们可以使用 apply 函数而不是 do 循环吗？

Answer 1

采纳答案by lleiou

Based on Abdou's answer, the following might save you some time:

根据 Abdou 的回答，以下内容可能会为您节省一些时间：

for col in df.columns:
    percentiles = df[col].quantile([0.01,0.99]).values
    df[col][df[col] <= percentiles[0]] = percentiles[0]
    df[col][df[col] >= percentiles[1]] = percentiles[1]

or using numpy.clip:

或使用numpy.clip：

import numpy as np
for col in df.columns:
    percentiles = df[col].quantile([0.01,0.99]).values
    df[col] = np.clip(df[col], percentiles[0], percentiles[1])

Answer 2

回答by Abdou

You can first define a helper function that takes in as arguments a seriesand a valueand changes that value according to the conditions mentioned above:

您可以首先定义一个辅助函数，该函数接受作为参数 aseries和 avalue并根据上述条件更改该值：

def scale_val(s, val):
    percentiles = s.quantile([0.01,0.99]).values
    if val <= percentiles[0]:
        return percentiles[0]
    elif val >= percentiles[1]:
        return percentiles[1]
    else:
        return val

Then you can use pd.DataFrame.applyand pd.Series.apply:

然后你可以使用pd.DataFrame.apply和pd.Series.apply：

df.apply(lambda s: s.apply(lambda v: scale_val(s,v)))

Please note that this may be a somewhat slow solution if you are dealing with a large amount of data, but I would suggest you give a shot and see if it will solve your problem within a reasonable time.

请注意，如果您正在处理大量数据，这可能是一个有点慢的解决方案，但我建议您试一试，看看它是否会在合理的时间内解决您的问题。

Edit:

编辑：

If you only want to get the percentiles for rows of dfwhere the column lrnvalis equal to "learning", you can modify the function to calculate the percentiles for only rows where that condition is true:

如果您只想获取df列lrnval等于“学习”的行的百分位数，您可以修改函数以仅计算该条件为真的行的百分位数：

def scale_val2(s, val):
    percentiles = s[df.lrnval.eq('learning')].quantile([0.01,0.99]).values
    if val <= percentiles[0]:
        return percentiles[0]
    elif val >= percentiles[1]:
        return percentiles[1]
    else:
        return val

Since there is a column that contains strings, I assume that you won't be doing any calculations on it. So, I would change the code as follows:

由于有一列包含字符串，我假设您不会对其进行任何计算。因此，我将更改代码如下：

df.filter(regex='[^lrnval]').apply(lambda s: s.apply(lambda v: scale_val2(s,v)))

I hope this proves useful.

我希望这证明是有用的。

pandas 如何采取地板和上限以去除异常值

提问by Gavin

采纳答案by lleiou

回答by Abdou

Edit:

编辑：

相关推荐

最近更新

标签

pandas 如何采取地板和上限以去除异常值

提问by Gavin

采纳答案by lleiou

回答by Abdou

Edit:

编辑：

相关推荐

如何用 Pandas 计算协方差矩阵

pandas 熊猫数据透视表重命名列

pandas 数据框任意两列之间的百分比差异

Pandas：更新列的值

相关推荐

最近更新

标签