Python 使用百分位数删除 Pandas DataFrame 中的异常值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35827863/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Remove Outliers in Pandas DataFrame using Percentiles
提问by Mi Funk
I have a DataFrame df with 40 columns and many records.
我有一个包含 40 列和许多记录的 DataFrame df。
df:
df:
User_id | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 |...| Col39
For each column except the user_id column I want to check for outliers and remove the whole record, if an outlier appears.
对于除 user_id 列之外的每一列,如果出现异常值,我想检查异常值并删除整个记录。
For outlier detection on each row I decided to simply use 5th and 95th percentile (I know it's not the best statistical way):
对于每一行的异常值检测,我决定简单地使用第 5 个和第 95 个百分位数(我知道这不是最好的统计方法):
Code what I have so far:
编码我到目前为止所拥有的:
P = np.percentile(df.Col1, [5, 95])
new_df = df[(df.Col1 > P[0]) & (df.Col1 < P[1])]
Question: How can I apply this approach to all columns (except User_id
) without doing this by hand? My goal is to get a dataframe without records that had outliers.
问题:如何在User_id
不手动执行此操作的情况下将此方法应用于所有列(除了)?我的目标是获得一个没有异常值记录的数据框。
Thank you!
谢谢!
回答by Romain
The initial dataset.
初始数据集。
print(df.head())
Col0 Col1 Col2 Col3 Col4 User_id
0 49 31 93 53 39 44
1 69 13 84 58 24 47
2 41 71 2 43 58 64
3 35 56 69 55 36 67
4 64 24 12 18 99 67
First removing the User_id
column
首先删除User_id
列
filt_df = df.loc[:, df.columns != 'User_id']
Then, computing percentiles.
然后,计算百分位数。
low = .05
high = .95
quant_df = filt_df.quantile([low, high])
print(quant_df)
Col0 Col1 Col2 Col3 Col4
0.05 2.00 3.00 6.9 3.95 4.00
0.95 95.05 89.05 93.0 94.00 97.05
Next filtering values based on computed percentiles. To do that I use an apply
by columns and that's it !
基于计算的百分位数的下一个过滤值。为此,我使用了一个apply
by 列,就是这样!
filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[low,x.name]) &
(x < quant_df.loc[high,x.name])], axis=0)
Bringing the User_id
back.
瞻User_id
回来。
filt_df = pd.concat([df.loc[:,'User_id'], filt_df], axis=1)
Last, rows with NaN
values can be dropped simply like this.
最后,NaN
可以像这样简单地删除带有值的行。
filt_df.dropna(inplace=True)
print(filt_df.head())
User_id Col0 Col1 Col2 Col3 Col4
1 47 69 13 84 58 24
3 67 35 56 69 55 36
5 9 95 79 44 45 69
6 83 69 41 66 87 6
9 87 50 54 39 53 40
Checking result
检查结果
print(filt_df.head())
User_id Col0 Col1 Col2 Col3 Col4
0 44 49 31 NaN 53 39
1 47 69 13 84 58 24
2 64 41 71 NaN 43 58
3 67 35 56 69 55 36
4 67 64 24 12 18 NaN
print(filt_df.describe())
User_id Col0 Col1 Col2 Col3 Col4
count 100.000000 89.000000 88.000000 88.000000 89.000000 89.000000
mean 48.230000 49.573034 45.659091 52.727273 47.460674 57.157303
std 28.372292 25.672274 23.537149 26.509477 25.823728 26.231876
min 0.000000 3.000000 5.000000 7.000000 4.000000 5.000000
25% 23.000000 29.000000 29.000000 29.500000 24.000000 36.000000
50% 47.000000 50.000000 40.500000 52.500000 49.000000 59.000000
75% 74.250000 69.000000 67.000000 75.000000 70.000000 79.000000
max 99.000000 95.000000 89.000000 92.000000 91.000000 97.000000
How to generate the test dataset
如何生成测试数据集
np.random.seed(0)
nb_sample = 100
num_sample = (0,100)
d = dict()
d['User_id'] = np.random.randint(num_sample[0], num_sample[1], nb_sample)
for i in range(5):
d['Col' + str(i)] = np.random.randint(num_sample[0], num_sample[1], nb_sample)
df = DataFrame.from_dict(d)
回答by E.Zolduoarrati
Use this code and don't waste your time:
使用此代码,不要浪费时间:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
回答by mgoldwasser
What you are describing is similar to the process of winsorizing, which clips values (for example, at the 5th and 95th percentiles) instead of eliminating them completely.
你所描述的类似于 winsorizing 的过程,它剪辑值(例如,在第 5 个和第 95 个百分位数)而不是完全消除它们。
Here's an example:
下面是一个例子:
import pandas as pd
from scipy.stats import mstats
%matplotlib inline
test_data = pd.Series(range(30))
test_data.plot()
# Truncate values to the 5th and 95th percentiles
transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05]))
transformed_test_data.plot()
回答by Rishabh Srivastava
Use an inner join. Something like this should work
使用内部联接。这样的事情应该工作
cols = df.columns.tolist()
cols.remove('user_id') #remove user_id from list of columns
P = np.percentile(df[cols[0]], [5, 95])
new_df = df[(df[cols[0] > P[0]) & (df[cols[0]] < P[1])]
for col in cols[1:]:
P = np.percentile(df[col], [5, 95])
new_df = new_df.join(df[(df[col] > P[0]]) & (df[col] < P[1])], how='inner')