pandas 如何采取地板和上限以去除异常值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42207920/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to take floor and capping for removing outliers
提问by Gavin
How to calculate 99% and 1% percentile as cap and floor for each column, the if value >= 99% percentile then redefine the value as the value of 99% percentile; similarly if value <= 1% percentile then redefine value as the value of 1% percentile
如何计算每列的 99% 和 1% 百分位数作为上限和下限,如果值 >= 99% 百分位数,则将该值重新定义为 99% 百分位数的值;同样,如果 value <= 1% 百分位数,则将 value 重新定义为 1% 百分位数的值
np.random.seed(2)
df = pd.DataFrame({'value1': np.random.randn(100), 'value2': np.random.randn(100)})
df['lrnval'] = np.where(np.random.random(df.shape[0])>=0.7, 'learning', 'validation')
if we have hundreds columns, can we use apply function instead of do loop?
如果我们有数百列,我们可以使用 apply 函数而不是 do 循环吗?
采纳答案by lleiou
Based on Abdou's answer, the following might save you some time:
根据 Abdou 的回答,以下内容可能会为您节省一些时间:
for col in df.columns:
percentiles = df[col].quantile([0.01,0.99]).values
df[col][df[col] <= percentiles[0]] = percentiles[0]
df[col][df[col] >= percentiles[1]] = percentiles[1]
or using numpy.clip:
或使用numpy.clip:
import numpy as np
for col in df.columns:
percentiles = df[col].quantile([0.01,0.99]).values
df[col] = np.clip(df[col], percentiles[0], percentiles[1])
回答by Abdou
You can first define a helper function that takes in as arguments a series
and a value
and changes that value according to the conditions mentioned above:
您可以首先定义一个辅助函数,该函数接受作为参数 aseries
和 avalue
并根据上述条件更改该值:
def scale_val(s, val):
percentiles = s.quantile([0.01,0.99]).values
if val <= percentiles[0]:
return percentiles[0]
elif val >= percentiles[1]:
return percentiles[1]
else:
return val
Then you can use pd.DataFrame.apply
and pd.Series.apply
:
然后你可以使用pd.DataFrame.apply
和pd.Series.apply
:
df.apply(lambda s: s.apply(lambda v: scale_val(s,v)))
Please note that this may be a somewhat slow solution if you are dealing with a large amount of data, but I would suggest you give a shot and see if it will solve your problem within a reasonable time.
请注意,如果您正在处理大量数据,这可能是一个有点慢的解决方案,但我建议您试一试,看看它是否会在合理的时间内解决您的问题。
Edit:
编辑:
If you only want to get the percentiles for rows of df
where the column lrnval
is equal to "learning", you can modify the function to calculate the percentiles for only rows where that condition is true:
如果您只想获取df
列lrnval
等于“学习”的行的百分位数,您可以修改函数以仅计算该条件为真的行的百分位数:
def scale_val2(s, val):
percentiles = s[df.lrnval.eq('learning')].quantile([0.01,0.99]).values
if val <= percentiles[0]:
return percentiles[0]
elif val >= percentiles[1]:
return percentiles[1]
else:
return val
Since there is a column that contains strings, I assume that you won't be doing any calculations on it. So, I would change the code as follows:
由于有一列包含字符串,我假设您不会对其进行任何计算。因此,我将更改代码如下:
df.filter(regex='[^lrnval]').apply(lambda s: s.apply(lambda v: scale_val2(s,v)))
I hope this proves useful.
我希望这证明是有用的。