Pandas 数据框 - 删除异常值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46245035/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas dataframe - remove outliers
提问by bayerb
Given a pandas dataframe, I want to exclude rows corresponding to outliers (Z-value = 3) based on one of the columns.
给定一个Pandas数据框,我想根据其中一列排除与异常值(Z 值 = 3)相对应的行。
The dataframe looks like this:
数据框如下所示:
df.dtypes
_id object
_index object
_score object
_source.address object
_source.district object
_source.price float64
_source.roomCount float64
_source.size float64
_type object
sort object
priceSquareMeter float64
dtype: object
For the line:
对于线路:
dff=df[(np.abs(stats.zscore(df)) < 3).all(axis='_source.price')]
The following exception is raised:
引发以下异常:
-------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-68-02fb15620e33> in <module>()
----> 1 dff=df[(np.abs(stats.zscore(df)) < 3).all(axis='_source.price')]
/opt/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py in zscore(a, axis, ddof)
2239 """
2240 a = np.asanyarray(a)
-> 2241 mns = a.mean(axis=axis)
2242 sstd = a.std(axis=axis, ddof=ddof)
2243 if axis and mns.ndim < a.ndim:
/opt/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims)
68 is_float16_result = True
69
---> 70 ret = umr_sum(arr, axis, dtype, out, keepdims)
71 if isinstance(ret, mu.ndarray):
72 ret = um.true_divide(
TypeError: unsupported operand type(s) for +: 'NoneType' and 'NoneType'
And the return value of
和返回值
np.isreal(df['_source.price']).all()
is
是
True
Why do I get the above exception, and how can I exclude the outliers?
为什么会出现上述异常,如何排除异常值?
回答by elf
Use this boolean whenever you have this sort of issue:
遇到此类问题时,请使用此布尔值:
df=pd.DataFrame({'Data':np.random.normal(size=200)}) #example
df[np.abs(df.Data-df.Data.mean())<=(3*df.Data.std())] #keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.
df[~(np.abs(df.Data-df.Data.mean())>(3*df.Data.std()))] #or the other way around
回答by Herpes Free Engineer
If one wants to use the Interquartile Rangeof a given dataset (i.e. IQR, as shown by a Wikipedia imagebelow) (Ref):
如果想要使用给定数据集的四分位距(即 IQR,如下面的维基百科图片所示)(参考):
def Remove_Outlier_Indices(df):
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
trueList = ~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR)))
return trueList
Based on the above eliminator function, the subset of outliers according to the dataset' statistical content can be obtained:
基于上述消除器函数,可以根据数据集的统计内容得到异常值的子集:
# Arbitrary Dataset for the Example
df = pd.DataFrame({'Data':np.random.normal(size=200)})
# Index List of Non-Outliers
nonOutlierList = Remove_Outlier_Indices(df)
# Non-Outlier Subset of the Given Dataset
dfSubset = df[nonOutlierList]
回答by Bruno F Souza
I believe you could create a boolean filter with the outliers and then select the oposite of it.
我相信你可以用异常值创建一个布尔过滤器,然后选择它的对立面。
outliers = stats.zscore(df['_source.price']).apply(lambda x: np.abs(x) == 3)
df_without_outliers = df[~outliers]