Python 检测并排除 Pandas 数据框中的异常值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23199796/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:28:49  来源:igfitidea点击:

Detect and exclude outliers in Pandas data frame

pythonpandasfilteringdataframeoutliers

提问by AMM

I have a pandas data frame with few columns.

我有一个包含几列的熊猫数据框。

Now I know that certain rows are outliers based on a certain column value.

现在我知道某些行是基于某个列值的异常值。

For instance

例如

column 'Vol' has all values around 12xxand one value is 4000(outlier).

'Vol' 列包含所有值12xx,其中一个值是4000(异常值)。

Now I would like to exclude those rows that have Volcolumn like this.

现在我想排除那些有这样Vol列的行。

So, essentially I need to put a filter on the data frame such that we select all rows where the values of a certain column are within, say, 3 standard deviations from mean.

因此,本质上我需要在数据框上放置一个过滤器,以便我们选择特定列的值在 3 个标准偏差范围内的所有行。

What is an elegant way to achieve this?

实现这一目标的优雅方式是什么?

回答by CT Zhu

Use booleanindexing as you would do in numpy.array

使用boolean索引,就像你在做的那样numpy.array

df = pd.DataFrame({'Data':np.random.normal(size=200)})
# example dataset of normally distributed data. 

df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]
# keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.

df[~(np.abs(df.Data-df.Data.mean()) > (3*df.Data.std()))]
# or if you prefer the other way around

For a series it is similar:

对于一个系列,它是类似的:

S = pd.Series(np.random.normal(size=200))
S[~((S-S.mean()).abs() > 3*S.std())]

回答by tanemaki

If you have multiple columns in your dataframe and would like to remove all rows that have outliers in at least one column, the following expression would do that in one shot.

如果您的数据框中有多个列,并且想要删除至少一列中有异常值的所有行,则以下表达式将一次性完成。

df = pd.DataFrame(np.random.randn(100, 3))

from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

description:

描述:

  • For each column, first it computes the Z-score of each value in the column, relative to the column mean and standard deviation.
  • Then is takes the absolute of Z-score because the direction does not matter, only if it is below the threshold.
  • all(axis=1) ensures that for each row, all column satisfy the constraint.
  • Finally, result of this condition is used to index the dataframe.
  • 对于每一列,它首先计算列中每个值的 Z 分数,相对于列平均值和标准偏差。
  • 然后是 Z-score 的绝对值,因为方向无关紧要,只有当它低于阈值时。
  • all(axis=1) 确保对于每一行,所有列都满足约束。
  • 最后,此条件的结果用于索引数据帧。

回答by Alexander

This answer is similar to that provided by @tanemaki, but uses a lambdaexpression instead of scipy stats.

此答案类似于@tanemaki 提供的答案,但使用lambda表达式而不是scipy stats.

df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))

df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]

To filter the DataFrame where only ONE column (e.g. 'B') is within three standard deviations:

要过滤只有一列(例如“B”)在三个标准偏差内的 DataFrame:

df[((df.B - df.B.mean()) / df.B.std()).abs() < 3]

See here for how to apply this z-score on a rolling basis: Rolling Z-score applied to pandas dataframe

请参阅此处了解如何在滚动基础上应用此 z 分数:滚动 Z 分数应用于熊猫数据帧

回答by Oleg N. Osychenko

scipy.statshas methods trim1()and trimboth()to cut the outliers out in a single row, according to the ranking and an introduced percentage of removed values.

scipy.stats有方法trim1()trimboth()根据排名和引入的移除值百分比将异常值删除在一行中。

回答by user6903745

For each of your dataframe column, you could get quantile with:

对于每个数据框列,您可以获得分位数:

q = df["col"].quantile(0.99)

and then filter with:

然后过滤:

df[df["col"] < q]

If one need to remove lower and upper outliers, combine condition with an AND statement:

如果需要删除上下异常值,请将条件与 AND 语句组合:

q_low = df["col"].quantile(0.01)
q_hi  = df["col"].quantile(0.99)

df_filtered = df[(df["col"] < q_hi) & (df["col"] > q_low)]

回答by mgoldwasser

Another option is to transform your data so that the effect of outliers is mitigated. You can do this by winsorizing your data.

另一种选择是转换您的数据,以减轻异常值的影响。您可以通过对数据进行 Winsorizing 来做到这一点。

import pandas as pd
from scipy.stats import mstats
%matplotlib inline

test_data = pd.Series(range(30))
test_data.plot()

Original data

原始数据

# Truncate values to the 5th and 95th percentiles
transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05])) 
transformed_test_data.plot()

Winsorized data

Winsorized 数据

回答by user2708149

#------------------------------------------------------------------------------
# accept a dataframe, remove outliers, return cleaned data in a new dataframe
# see http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
#------------------------------------------------------------------------------
def remove_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
    return df_out

回答by Ted Petrou

If you like method chaining, you can get your boolean condition for all numeric columns like this:

如果你喜欢方法链,你可以得到所有数字列的布尔条件,如下所示:

df.sub(df.mean()).div(df.std()).abs().lt(3)

Each value of each column will be converted to True/Falsebased on whether its less than three standard deviations away from the mean or not.

每列的每个值将True/False根据其与平均值的距离是否小于三个标准差来转换。

回答by Wagner Cipriano

a full example with data and 2 groups follows:

包含数据和 2 个组的完整示例如下:

Imports:

进口:

from StringIO import StringIO
import pandas as pd
#pandas config
pd.set_option('display.max_rows', 20)

Data example with 2 groups: G1:Group 1. G2: Group 2:

包含 2 个组的数据示例:G1:第 1 组。G2:第 2 组:

TESTDATA = StringIO("""G1;G2;Value
1;A;1.6
1;A;5.1
1;A;7.1
1;A;8.1

1;B;21.1
1;B;22.1
1;B;24.1
1;B;30.6

2;A;40.6
2;A;51.1
2;A;52.1
2;A;60.6

2;B;80.1
2;B;70.6
2;B;90.6
2;B;85.1
""")

Read text data to pandas dataframe:

将文本数据读取到 Pandas 数据框:

df = pd.read_csv(TESTDATA, sep=";")

Define the outliers using standard deviations

使用标准偏差定义异常值

stds = 1.0
outliers = df[['G1', 'G2', 'Value']].groupby(['G1','G2']).transform(
           lambda group: (group - group.mean()).abs().div(group.std())) > stds

Define filtered data values and the outliers:

定义过滤数据值和异常值:

dfv = df[outliers.Value == False]
dfo = df[outliers.Value == True]

Print the result:

打印结果:

print '\n'*5, 'All values with decimal 1 are non-outliers. In the other hand, all values with 6 in the decimal are.'
print '\nDef DATA:\n%s\n\nFiltred Values with %s stds:\n%s\n\nOutliers:\n%s' %(df, stds, dfv, dfo)

回答by luminousmen

My function for dropping outliers

我的删除异常值的功能

def drop_outliers(df, field_name):
    distance = 1.5 * (np.percentile(df[field_name], 75) - np.percentile(df[field_name], 25))
    df.drop(df[df[field_name] > distance + np.percentile(df[field_name], 75)].index, inplace=True)
    df.drop(df[df[field_name] < np.percentile(df[field_name], 25) - distance].index, inplace=True)