Python 检测并排除 Pandas 数据框中的异常值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23199796/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Detect and exclude outliers in Pandas data frame
提问by AMM
I have a pandas data frame with few columns.
我有一个包含几列的熊猫数据框。
Now I know that certain rows are outliers based on a certain column value.
现在我知道某些行是基于某个列值的异常值。
For instance
例如
column 'Vol' has all values around
12xx
and one value is4000
(outlier).
'Vol' 列包含所有值
12xx
,其中一个值是4000
(异常值)。
Now I would like to exclude those rows that have Vol
column like this.
现在我想排除那些有这样Vol
列的行。
So, essentially I need to put a filter on the data frame such that we select all rows where the values of a certain column are within, say, 3 standard deviations from mean.
因此,本质上我需要在数据框上放置一个过滤器,以便我们选择特定列的值在 3 个标准偏差范围内的所有行。
What is an elegant way to achieve this?
实现这一目标的优雅方式是什么?
回答by CT Zhu
Use boolean
indexing as you would do in numpy.array
使用boolean
索引,就像你在做的那样numpy.array
df = pd.DataFrame({'Data':np.random.normal(size=200)})
# example dataset of normally distributed data.
df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]
# keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.
df[~(np.abs(df.Data-df.Data.mean()) > (3*df.Data.std()))]
# or if you prefer the other way around
For a series it is similar:
对于一个系列,它是类似的:
S = pd.Series(np.random.normal(size=200))
S[~((S-S.mean()).abs() > 3*S.std())]
回答by tanemaki
If you have multiple columns in your dataframe and would like to remove all rows that have outliers in at least one column, the following expression would do that in one shot.
如果您的数据框中有多个列,并且想要删除至少一列中有异常值的所有行,则以下表达式将一次性完成。
df = pd.DataFrame(np.random.randn(100, 3))
from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
description:
描述:
- For each column, first it computes the Z-score of each value in the column, relative to the column mean and standard deviation.
- Then is takes the absolute of Z-score because the direction does not matter, only if it is below the threshold.
- all(axis=1) ensures that for each row, all column satisfy the constraint.
- Finally, result of this condition is used to index the dataframe.
- 对于每一列,它首先计算列中每个值的 Z 分数,相对于列平均值和标准偏差。
- 然后是 Z-score 的绝对值,因为方向无关紧要,只有当它低于阈值时。
- all(axis=1) 确保对于每一行,所有列都满足约束。
- 最后,此条件的结果用于索引数据帧。
回答by Alexander
This answer is similar to that provided by @tanemaki, but uses a lambda
expression instead of scipy stats
.
此答案类似于@tanemaki 提供的答案,但使用lambda
表达式而不是scipy stats
.
df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))
df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]
To filter the DataFrame where only ONE column (e.g. 'B') is within three standard deviations:
要过滤只有一列(例如“B”)在三个标准偏差内的 DataFrame:
df[((df.B - df.B.mean()) / df.B.std()).abs() < 3]
See here for how to apply this z-score on a rolling basis: Rolling Z-score applied to pandas dataframe
请参阅此处了解如何在滚动基础上应用此 z 分数:滚动 Z 分数应用于熊猫数据帧
回答by Oleg N. Osychenko
scipy.stats
has methods trim1()
and trimboth()
to cut the outliers out in a single row, according to the ranking and an introduced percentage of removed values.
scipy.stats
有方法trim1()
并trimboth()
根据排名和引入的移除值百分比将异常值删除在一行中。
回答by user6903745
For each of your dataframe column, you could get quantile with:
对于每个数据框列,您可以获得分位数:
q = df["col"].quantile(0.99)
and then filter with:
然后过滤:
df[df["col"] < q]
If one need to remove lower and upper outliers, combine condition with an AND statement:
如果需要删除上下异常值,请将条件与 AND 语句组合:
q_low = df["col"].quantile(0.01)
q_hi = df["col"].quantile(0.99)
df_filtered = df[(df["col"] < q_hi) & (df["col"] > q_low)]
回答by mgoldwasser
Another option is to transform your data so that the effect of outliers is mitigated. You can do this by winsorizing your data.
另一种选择是转换您的数据,以减轻异常值的影响。您可以通过对数据进行 Winsorizing 来做到这一点。
import pandas as pd
from scipy.stats import mstats
%matplotlib inline
test_data = pd.Series(range(30))
test_data.plot()
# Truncate values to the 5th and 95th percentiles
transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05]))
transformed_test_data.plot()
回答by user2708149
#------------------------------------------------------------------------------
# accept a dataframe, remove outliers, return cleaned data in a new dataframe
# see http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
#------------------------------------------------------------------------------
def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
return df_out
回答by Ted Petrou
If you like method chaining, you can get your boolean condition for all numeric columns like this:
如果你喜欢方法链,你可以得到所有数字列的布尔条件,如下所示:
df.sub(df.mean()).div(df.std()).abs().lt(3)
Each value of each column will be converted to True/False
based on whether its less than three standard deviations away from the mean or not.
每列的每个值将True/False
根据其与平均值的距离是否小于三个标准差来转换。
回答by Wagner Cipriano
a full example with data and 2 groups follows:
包含数据和 2 个组的完整示例如下:
Imports:
进口:
from StringIO import StringIO
import pandas as pd
#pandas config
pd.set_option('display.max_rows', 20)
Data example with 2 groups: G1:Group 1. G2: Group 2:
包含 2 个组的数据示例:G1:第 1 组。G2:第 2 组:
TESTDATA = StringIO("""G1;G2;Value
1;A;1.6
1;A;5.1
1;A;7.1
1;A;8.1
1;B;21.1
1;B;22.1
1;B;24.1
1;B;30.6
2;A;40.6
2;A;51.1
2;A;52.1
2;A;60.6
2;B;80.1
2;B;70.6
2;B;90.6
2;B;85.1
""")
Read text data to pandas dataframe:
将文本数据读取到 Pandas 数据框:
df = pd.read_csv(TESTDATA, sep=";")
Define the outliers using standard deviations
使用标准偏差定义异常值
stds = 1.0
outliers = df[['G1', 'G2', 'Value']].groupby(['G1','G2']).transform(
lambda group: (group - group.mean()).abs().div(group.std())) > stds
Define filtered data values and the outliers:
定义过滤数据值和异常值:
dfv = df[outliers.Value == False]
dfo = df[outliers.Value == True]
Print the result:
打印结果:
print '\n'*5, 'All values with decimal 1 are non-outliers. In the other hand, all values with 6 in the decimal are.'
print '\nDef DATA:\n%s\n\nFiltred Values with %s stds:\n%s\n\nOutliers:\n%s' %(df, stds, dfv, dfo)
回答by luminousmen
My function for dropping outliers
我的删除异常值的功能
def drop_outliers(df, field_name):
distance = 1.5 * (np.percentile(df[field_name], 75) - np.percentile(df[field_name], 25))
df.drop(df[df[field_name] > distance + np.percentile(df[field_name], 75)].index, inplace=True)
df.drop(df[df[field_name] < np.percentile(df[field_name], 25) - distance].index, inplace=True)