pandas 如何在python中处理机器学习中缺少的NaN

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27824954/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:49:13  来源:igfitidea点击:

How to handle missing NaNs for machine learning in python

pythonpandasmachine-learningmissing-data

提问by pbu

How to handle missing values in datasets before applying machine learning algorithm??.

在应用机器学习算法之前如何处理数据集中的缺失值?

I noticed that it is not a smart thing to drop missing NAN values. I usually do interpolate (compute mean) using pandas and fill it up the data which is kind of works and improves the classification accuracy but may not be the best thing to do.

我注意到删除缺失的 NAN 值并不是一件明智的事情。我通常使用 Pandas 进行插值(计算平均值)并填充数据,这是一种有效的方法并提高分类准确性,但可能不是最好的做法。

Here is a very important question. What is the best way to handle missing values in data set?

这里有一个非常重要的问题。处理数据集中缺失值的最佳方法是什么?

For example if you see this dataset, only 30% has original data.

例如,如果您看到此数据集,则只有 30% 具有原始数据。

Int64Index: 7049 entries, 0 to 7048
Data columns (total 31 columns):
left_eye_center_x            7039 non-null float64
left_eye_center_y            7039 non-null float64
right_eye_center_x           7036 non-null float64
right_eye_center_y           7036 non-null float64
left_eye_inner_corner_x      2271 non-null float64
left_eye_inner_corner_y      2271 non-null float64
left_eye_outer_corner_x      2267 non-null float64
left_eye_outer_corner_y      2267 non-null float64
right_eye_inner_corner_x     2268 non-null float64
right_eye_inner_corner_y     2268 non-null float64
right_eye_outer_corner_x     2268 non-null float64
right_eye_outer_corner_y     2268 non-null float64
left_eyebrow_inner_end_x     2270 non-null float64
left_eyebrow_inner_end_y     2270 non-null float64
left_eyebrow_outer_end_x     2225 non-null float64
left_eyebrow_outer_end_y     2225 non-null float64
right_eyebrow_inner_end_x    2270 non-null float64
right_eyebrow_inner_end_y    2270 non-null float64
right_eyebrow_outer_end_x    2236 non-null float64
right_eyebrow_outer_end_y    2236 non-null float64
nose_tip_x                   7049 non-null float64
nose_tip_y                   7049 non-null float64
mouth_left_corner_x          2269 non-null float64
mouth_left_corner_y          2269 non-null float64
mouth_right_corner_x         2270 non-null float64
mouth_right_corner_y         2270 non-null float64
mouth_center_top_lip_x       2275 non-null float64
mouth_center_top_lip_y       2275 non-null float64
mouth_center_bottom_lip_x    7016 non-null float64
mouth_center_bottom_lip_y    7016 non-null float64
Image                        7049 non-null object

回答by Paul Lo

What is the best way to handle missing values in data set?

There is NO best way, each solution/algorithm has their own pros and cons (and you can even mix some of them together to create your own strategy and tune the related parameters to come up one best satisfy your data, there are many research/papers about this topic).

没有最好的方法,每种解决方案/算法都有自己的优缺点(您甚至可以将其中一些混合在一起以创建自己的策略并调整相关参数以找到最能满足您数据的方法,有很多研究/有关该主题的论文)。

For example, Mean Imputationis quick and simple, but it would underestimate the variance and the distribution shape is distorted by replacing NaN with the mean value, while KNN Imputationmight not be ideal in a large data set in terms of time complexity, since it iterate over all the data points and perform calculation for each NaN value, and the assumption is that NaN attribute is correlated with other attributes.

例如,Mean Imputation快速且简单,但它会低估方差,并且通过用平均值替换 NaN 会扭曲分布形状,而KNN Imputation在时间复杂度方面可能不是大数据集的理想选择,因为它遍历所有数据点,对每个 NaN 值进行计算,假设 NaN 属性与其他属性相关。

How to handle missing values in datasets before applying machine learning algorithm??

In addition to mean imputationyou mention, you could also take a look at K-Nearest Neighbor Imputationand Regression Imputation, and refer to the powerful Imputerclass in scikit-learnto check existing APIs to use.

除了您提到的均值插补之外,您还可以查看K-Nearest Neighbor ImputationRegression Imputation,并参考scikit-learn 中强大的Imputer类来检查要使用的现有 API。

KNN Imputation

KNN 插补

Calculate the mean of k nearest neighbors of this NaN point.

计算此 NaN 点的 k 个最近邻的平均值。

Regression Imputation

回归插补

A regression model is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where that variable is missing.

估计回归模型以基于其他变量预测变量的观察值,然后使用该模型在该变量缺失的情况下估算值。

Herelinks to scikit's 'Imputation of missing values'section. I have also heard of Orangelibrary for imputation, but haven't had a chance to use it yet.

这里链接到 scikit 的“缺失值的估算”部分。我也听说过用于插补的Orange库,但还没有机会使用它。

回答by Alex Rubinsteyn

There's no single best way to deal with missing data. The most rigorous approach is to model the missing values as additional parameters in a probabilistic framework like PyMC. This way you'll get a distribution over possible values, instead of just a single answer. Here's an example of dealing with missing data using PyMC: http://stronginference.com/missing-data-imputation.html

没有单一的最佳方法来处理丢失的数据。最严格的方法是将缺失值建模为 PyMC 等概率框架中的附加参数。通过这种方式,您将获得可能值的分布,而不仅仅是一个答案。以下是使用 PyMC 处理缺失数据的示例:http://stronginference.com/missing-data-imputation.html

If you really want to plug those holes with point estimates, then you're looking to perform "imputation". I'd steer away from simple imputation methods like mean-filling since they really butcher the joint distribution of your features. Instead, try something like softImpute(which tries you infer the missing value via low-rank approximation). The original version of softImpute is written for R but I've made a Python version (along with other methods like kNN imputation) here: https://github.com/hammerlab/fancyimpute

如果你真的想用点估计来填补这些漏洞,那么你正在寻找执行“插补”。我会避开像均值填充这样的简单插补方法,因为它们确实会破坏特征的联合分布。相反,尝试像softImpute这样的东西(它尝试通过低秩近似推断缺失值)。softImpute 的原始版本是为 R 编写的,但我在这里制作了一个 Python 版本(以及其他方法,如 kNN 插补):https: //github.com/hammerlab/fancyimpute