检查 Pandas 数据框的异常值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48087534/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:00:42  来源:igfitidea点击:

Checking a Pandas Dataframe for Outliers

pythonpandas

提问by Adi

Plot of sensor

传感器图

I have an experiment on a sensor that contains 8 electrodes. The image above is a plot of the electrode output vs time. As you can see on the plot, one of the 8 electrodes is clearly an outlier (probably due to some electrical failure). The plot is generated from a Pandas DataFrame, which essentially has 10 columns (1 for time, 8 for the electrodes, and 1 averaging the 8 electrodes).

我对包含 8 个电极的传感器进行了实验。上图是电极输出与时间的关系图。正如您在图中看到的那样,8 个电极之一显然是异常值(可能是由于某些电气故障)。该图是从 Pandas DataFrame 生成的,它基本上有 10 列(1 列表示时间,8 列表示电极,1 列平均 8 个电极)。

What is the best way to statistically detect that one of the columns is an outlier? I imagine the outlier column can then just be dropped from the dataframe.

统计检测其中一列是异常值的最佳方法是什么?我想象离群值列然后可以从数据框中删除。

Thanks!

谢谢!

回答by Shaz

Scatter plots or distribution plots are good for pointing outliers. But in context to the question of pandas data frames here's how I would do it.

散点图或分布图适用于指出异常值。但在Pandas数据框问题的上下文中,我将如何做到这一点。

df.decribe()

df.decribe()

Will give you a good matrix of mean, max, and all percentile. Look into the max of the column to point out the outlier if its greater than 75 percentile of values.

会给你一个很好的均值、最大值和所有百分位矩阵。查看列的最大值以指出异常值是否大于 75 个百分位的值。

Then df['Sensor Value'].value_counts()should give you the frequency of the values. You will have the outliers shown right here with greater values and that of less frequency.

然后df['Sensor Value'].value_counts()应该给你值的频率。您将在此处显示具有更大值和更低频率的异常值。

Get their indexes and just drop them using df.drop(indexes_list, inplace=True)

获取他们的索引,然后使用 df.drop(indexes_list, inplace=True)

EDIT: You could also check outlier with mean +/- 3 * standard deviation.

编辑:您还可以使用mean +/- 3 * standard deviation.

Example code:

示例代码:

outliers = df[df[col] > df[col].mean() + 3 * df[col].std()]