从 Pandas 数据帧 python 中删除异常值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45461608/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:09:39  来源:igfitidea点击:

Remove outliers from pandas dataframe python

pythonpandasoutliers

提问by eliza.b

I have a code that creates a dataframe using pandas

我有一个使用Pandas创建数据框的代码

import pandas as pd
import numpy as np

x = (g[0].time[:111673])
y = (g[0].data.f[:111673])
df = pd.DataFrame({'Time': x, 'Data': y})
#df

This prints out:

这打印出来:

          Data          Time
0        -0.704239      7.304021
1        -0.704239      7.352021
2        -0.704239      7.400021
3        -0.704239      7.448021
4        -0.825279      7.496021

Which is great but I know there are outliers in this data that I want removed so I created this dataframe below to point them out:

这很好,但我知道我想删除这些数据中的异常值,所以我在下面创建了这个数据框来指出它们:

newdf = df.copy()
Data = newdf.groupby('Data')
newdf[np.abs(newdf.Data-newdf.Data.mean())<=(3*newdf.Data.std())]
newdf['Outlier'] = Data.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() )
#newdf

This prints out:

这打印出来:

             Data          Time  Outlier
0        -0.704239      7.304021    False
1        -0.704239      7.352021    False
2        -0.704239      7.400021    False
3        -0.704239      7.448021    False
4        -0.825279      7.496021    False

In the example of my data you cant see it but there are maybe 300 outliers and I want to remove them without messing with the original dataframe and then plot them together as a compression. My question is this: So instead of printing out false/true how can I just eliminate the outliers that are true? so I can eventually plot them in the same graph for a comparison.

在我的数据示例中,您看不到它,但可能有 300 个异常值,我想在不弄乱原始数据帧的情况下删除它们,然后将它们绘制在一起作为压缩。我的问题是这样的:那么,我怎样才能消除正确的异常值,而不是打印出 false/true?所以我最终可以将它们绘制在同一张图中进行比较。

Codes I have already tried:

我已经尝试过的代码:

newdf[np.abs(newdf.Data-newdf.Data.mean())<=(1.96*newdf.Data.std())]

newdf = df.copy()
def replace_outliers_with_nan(df, stdvs):
    newdf=pd.DataFrame()
    for i, col in enumerate(df.sites.unique()):
        df = pd.DataFrame(df[df.sites==col])
        idx = [np.abs(df-df.mean())<=(stdvs*df.std())] 
        df[idx==False]=np.nan  
        newdf[col] = df
    return newdf

Both of these doesn't work, they returns the same amount of data points as my original dataframe however I know that if it removed the outliers the amount of points would be less than the original.

这两个都不起作用,它们返回与我的原始数据帧相同数量的数据点,但是我知道如果它删除了异常值,点的数量将少于原始数据。

采纳答案by jezrael

It seems you need boolean indexingwith ~for invert condition, because need filter only not outliers rows (and drop outliers):

似乎您需要boolean indexing使用~反转条件,因为只需要过滤掉异常值行(并删除异常值):

df1 = df[~df.groupby('Data').transform( lambda x: abs(x-x.mean()) > 1.96*x.std()).values]
print (df1)
       Data      Time
0 -0.704239  7.304021
1 -0.704239  7.352021
2 -0.704239  7.400021
3 -0.704239  7.448021
4 -0.825279  7.496021