检查 Pandas 数据框的异常值

Question

提问by Adi

I have an experiment on a sensor that contains 8 electrodes. The image above is a plot of the electrode output vs time. As you can see on the plot, one of the 8 electrodes is clearly an outlier (probably due to some electrical failure). The plot is generated from a Pandas DataFrame, which essentially has 10 columns (1 for time, 8 for the electrodes, and 1 averaging the 8 electrodes).

我对包含 8 个电极的传感器进行了实验。上图是电极输出与时间的关系图。正如您在图中看到的那样，8 个电极之一显然是异常值（可能是由于某些电气故障）。该图是从 Pandas DataFrame 生成的，它基本上有 10 列（1 列表示时间，8 列表示电极，1 列平均 8 个电极）。

What is the best way to statistically detect that one of the columns is an outlier? I imagine the outlier column can then just be dropped from the dataframe.

统计检测其中一列是异常值的最佳方法是什么？我想象离群值列然后可以从数据框中删除。

Thanks!

谢谢！

Answer 1

回答by Shaz

Scatter plots or distribution plots are good for pointing outliers. But in context to the question of pandas data frames here's how I would do it.

散点图或分布图适用于指出异常值。但在Pandas数据框问题的上下文中，我将如何做到这一点。

df.decribe()

Will give you a good matrix of mean, max, and all percentile. Look into the max of the column to point out the outlier if its greater than 75 percentile of values.

会给你一个很好的均值、最大值和所有百分位矩阵。查看列的最大值以指出异常值是否大于 75 个百分位的值。

Then df['Sensor Value'].value_counts()should give you the frequency of the values. You will have the outliers shown right here with greater values and that of less frequency.

然后df['Sensor Value'].value_counts()应该给你值的频率。您将在此处显示具有更大值和更低频率的异常值。

Get their indexes and just drop them using df.drop(indexes_list, inplace=True)

获取他们的索引，然后使用 df.drop(indexes_list, inplace=True)

EDIT: You could also check outlier with mean +/- 3 * standard deviation.

编辑：您还可以使用mean +/- 3 * standard deviation.

Example code:

示例代码：

outliers = df[df[col] > df[col].mean() + 3 * df[col].std()]

检查 Pandas 数据框的异常值

提问by Adi

回答by Shaz

相关推荐

最近更新

标签

检查 Pandas 数据框的异常值

提问by Adi

回答by Shaz

相关推荐

连接到 Hive 并使用 Pandas 创建表

返回两个新列的 Pandas Apply 函数

pandas 熊猫合并错误类型错误：“int”和“str”的实例之间不支持“>”

在 Pandas 中，.iloc 方法是否提供副本或视图？

相关推荐

最近更新

标签