Python 使用百分位数删除 Pandas DataFrame 中的异常值

Question

提问by Mi Funk

I have a DataFrame df with 40 columns and many records.

我有一个包含 40 列和许多记录的 DataFrame df。

df:

df：

User_id | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 |...| Col39

For each column except the user_id column I want to check for outliers and remove the whole record, if an outlier appears.

对于除 user_id 列之外的每一列，如果出现异常值，我想检查异常值并删除整个记录。

For outlier detection on each row I decided to simply use 5th and 95th percentile (I know it's not the best statistical way):

对于每一行的异常值检测，我决定简单地使用第 5 个和第 95 个百分位数（我知道这不是最好的统计方法）：

Code what I have so far:

编码我到目前为止所拥有的：

P = np.percentile(df.Col1, [5, 95])
new_df = df[(df.Col1 > P[0]) & (df.Col1 < P[1])]

Question: How can I apply this approach to all columns (except User_id) without doing this by hand? My goal is to get a dataframe without records that had outliers.

问题：如何在User_id不手动执行此操作的情况下将此方法应用于所有列（除了）？我的目标是获得一个没有异常值记录的数据框。

Thank you!

谢谢！

Answer 1

回答by Romain

The initial dataset.

初始数据集。

print(df.head())

   Col0  Col1  Col2  Col3  Col4  User_id
0    49    31    93    53    39       44
1    69    13    84    58    24       47
2    41    71     2    43    58       64
3    35    56    69    55    36       67
4    64    24    12    18    99       67

First removing the User_idcolumn

首先删除User_id列

filt_df = df.loc[:, df.columns != 'User_id']

Then, computing percentiles.

然后，计算百分位数。

low = .05
high = .95
quant_df = filt_df.quantile([low, high])
print(quant_df)

       Col0   Col1  Col2   Col3   Col4
0.05   2.00   3.00   6.9   3.95   4.00
0.95  95.05  89.05  93.0  94.00  97.05

Next filtering values based on computed percentiles. To do that I use an applyby columns and that's it !

基于计算的百分位数的下一个过滤值。为此，我使用了一个applyby 列，就是这样！

filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[low,x.name]) & 
                                    (x < quant_df.loc[high,x.name])], axis=0)

Bringing the User_idback.

瞻User_id回来。

filt_df = pd.concat([df.loc[:,'User_id'], filt_df], axis=1)

Last, rows with NaNvalues can be dropped simply like this.

最后，NaN可以像这样简单地删除带有值的行。

filt_df.dropna(inplace=True)
print(filt_df.head())

   User_id  Col0  Col1  Col2  Col3  Col4
1       47    69    13    84    58    24
3       67    35    56    69    55    36
5        9    95    79    44    45    69
6       83    69    41    66    87     6
9       87    50    54    39    53    40

Checking result

检查结果

print(filt_df.head())

   User_id  Col0  Col1  Col2  Col3  Col4
0       44    49    31   NaN    53    39
1       47    69    13    84    58    24
2       64    41    71   NaN    43    58
3       67    35    56    69    55    36
4       67    64    24    12    18   NaN

print(filt_df.describe())

          User_id       Col0       Col1       Col2       Col3       Col4
count  100.000000  89.000000  88.000000  88.000000  89.000000  89.000000
mean    48.230000  49.573034  45.659091  52.727273  47.460674  57.157303
std     28.372292  25.672274  23.537149  26.509477  25.823728  26.231876
min      0.000000   3.000000   5.000000   7.000000   4.000000   5.000000
25%     23.000000  29.000000  29.000000  29.500000  24.000000  36.000000
50%     47.000000  50.000000  40.500000  52.500000  49.000000  59.000000
75%     74.250000  69.000000  67.000000  75.000000  70.000000  79.000000
max     99.000000  95.000000  89.000000  92.000000  91.000000  97.000000

How to generate the test dataset

如何生成测试数据集

np.random.seed(0)
nb_sample = 100
num_sample = (0,100)

d = dict()
d['User_id'] = np.random.randint(num_sample[0], num_sample[1], nb_sample)
for i in range(5):
    d['Col' + str(i)] = np.random.randint(num_sample[0], num_sample[1], nb_sample)

df = DataFrame.from_dict(d)

Answer 2

回答by E.Zolduoarrati

Use this code and don't waste your time:

使用此代码，不要浪费时间：

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]

Answer 3

回答by mgoldwasser

What you are describing is similar to the process of winsorizing, which clips values (for example, at the 5th and 95th percentiles) instead of eliminating them completely.

你所描述的类似于 winsorizing 的过程，它剪辑值（例如，在第 5 个和第 95 个百分位数）而不是完全消除它们。

Here's an example:

下面是一个例子：

import pandas as pd
from scipy.stats import mstats
%matplotlib inline

test_data = pd.Series(range(30))
test_data.plot()

# Truncate values to the 5th and 95th percentiles
transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05])) 
transformed_test_data.plot()

Answer 4

回答by Rishabh Srivastava

Use an inner join. Something like this should work

使用内部联接。这样的事情应该工作

cols = df.columns.tolist()
cols.remove('user_id') #remove user_id from list of columns

P = np.percentile(df[cols[0]], [5, 95])
new_df = df[(df[cols[0] > P[0]) & (df[cols[0]] < P[1])]
for col in cols[1:]:
    P = np.percentile(df[col], [5, 95])
    new_df = new_df.join(df[(df[col] > P[0]]) & (df[col] < P[1])], how='inner')

Python 使用百分位数删除 Pandas DataFrame 中的异常值

提问by Mi Funk

回答by Romain

Checking result

检查结果

How to generate the test dataset

如何生成测试数据集

回答by E.Zolduoarrati

回答by mgoldwasser

回答by Rishabh Srivastava

相关推荐

最近更新

标签

Python 使用百分位数删除 Pandas DataFrame 中的异常值

提问by Mi Funk

回答by Romain

Checking result

检查结果

How to generate the test dataset

如何生成测试数据集

回答by E.Zolduoarrati

回答by mgoldwasser

回答by Rishabh Srivastava

相关推荐

Python AttributeError: 模块“pandas”没有属性“core”

Python 如何在 Windows 中使用 pip？

在 Python 中使用 Keras 的神经网络中的特征重要性图表

python用googlemaps显示地图

相关推荐

最近更新

标签