pandas 如何在python中处理机器学习中缺少的NaN
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27824954/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to handle missing NaNs for machine learning in python
提问by pbu
How to handle missing values in datasets before applying machine learning algorithm??.
在应用机器学习算法之前如何处理数据集中的缺失值?
I noticed that it is not a smart thing to drop missing NAN values. I usually do interpolate (compute mean) using pandas and fill it up the data which is kind of works and improves the classification accuracy but may not be the best thing to do.
我注意到删除缺失的 NAN 值并不是一件明智的事情。我通常使用 Pandas 进行插值(计算平均值)并填充数据,这是一种有效的方法并提高分类准确性,但可能不是最好的做法。
Here is a very important question. What is the best way to handle missing values in data set?
这里有一个非常重要的问题。处理数据集中缺失值的最佳方法是什么?
For example if you see this dataset, only 30% has original data.
例如,如果您看到此数据集,则只有 30% 具有原始数据。
Int64Index: 7049 entries, 0 to 7048
Data columns (total 31 columns):
left_eye_center_x 7039 non-null float64
left_eye_center_y 7039 non-null float64
right_eye_center_x 7036 non-null float64
right_eye_center_y 7036 non-null float64
left_eye_inner_corner_x 2271 non-null float64
left_eye_inner_corner_y 2271 non-null float64
left_eye_outer_corner_x 2267 non-null float64
left_eye_outer_corner_y 2267 non-null float64
right_eye_inner_corner_x 2268 non-null float64
right_eye_inner_corner_y 2268 non-null float64
right_eye_outer_corner_x 2268 non-null float64
right_eye_outer_corner_y 2268 non-null float64
left_eyebrow_inner_end_x 2270 non-null float64
left_eyebrow_inner_end_y 2270 non-null float64
left_eyebrow_outer_end_x 2225 non-null float64
left_eyebrow_outer_end_y 2225 non-null float64
right_eyebrow_inner_end_x 2270 non-null float64
right_eyebrow_inner_end_y 2270 non-null float64
right_eyebrow_outer_end_x 2236 non-null float64
right_eyebrow_outer_end_y 2236 non-null float64
nose_tip_x 7049 non-null float64
nose_tip_y 7049 non-null float64
mouth_left_corner_x 2269 non-null float64
mouth_left_corner_y 2269 non-null float64
mouth_right_corner_x 2270 non-null float64
mouth_right_corner_y 2270 non-null float64
mouth_center_top_lip_x 2275 non-null float64
mouth_center_top_lip_y 2275 non-null float64
mouth_center_bottom_lip_x 7016 non-null float64
mouth_center_bottom_lip_y 7016 non-null float64
Image 7049 non-null object
回答by Paul Lo
What is the best way to handle missing values in data set?
There is NO best way, each solution/algorithm has their own pros and cons (and you can even mix some of them together to create your own strategy and tune the related parameters to come up one best satisfy your data, there are many research/papers about this topic).
没有最好的方法,每种解决方案/算法都有自己的优缺点(您甚至可以将其中一些混合在一起以创建自己的策略并调整相关参数以找到最能满足您数据的方法,有很多研究/有关该主题的论文)。
For example, Mean Imputationis quick and simple, but it would underestimate the variance and the distribution shape is distorted by replacing NaN with the mean value, while KNN Imputationmight not be ideal in a large data set in terms of time complexity, since it iterate over all the data points and perform calculation for each NaN value, and the assumption is that NaN attribute is correlated with other attributes.
例如,Mean Imputation快速且简单,但它会低估方差,并且通过用平均值替换 NaN 会扭曲分布形状,而KNN Imputation在时间复杂度方面可能不是大数据集的理想选择,因为它遍历所有数据点,对每个 NaN 值进行计算,假设 NaN 属性与其他属性相关。
How to handle missing values in datasets before applying machine learning algorithm??
In addition to mean imputationyou mention, you could also take a look at K-Nearest Neighbor Imputationand Regression Imputation, and refer to the powerful Imputerclass in scikit-learnto check existing APIs to use.
除了您提到的均值插补之外,您还可以查看K-Nearest Neighbor Imputation和Regression Imputation,并参考scikit-learn 中强大的Imputer类来检查要使用的现有 API。
KNN Imputation
KNN 插补
Calculate the mean of k nearest neighbors of this NaN point.
计算此 NaN 点的 k 个最近邻的平均值。
Regression Imputation
回归插补
A regression model is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where that variable is missing.
估计回归模型以基于其他变量预测变量的观察值,然后使用该模型在该变量缺失的情况下估算值。
Herelinks to scikit's 'Imputation of missing values'section. I have also heard of Orangelibrary for imputation, but haven't had a chance to use it yet.
回答by Alex Rubinsteyn
There's no single best way to deal with missing data. The most rigorous approach is to model the missing values as additional parameters in a probabilistic framework like PyMC. This way you'll get a distribution over possible values, instead of just a single answer. Here's an example of dealing with missing data using PyMC: http://stronginference.com/missing-data-imputation.html
没有单一的最佳方法来处理丢失的数据。最严格的方法是将缺失值建模为 PyMC 等概率框架中的附加参数。通过这种方式,您将获得可能值的分布,而不仅仅是一个答案。以下是使用 PyMC 处理缺失数据的示例:http://stronginference.com/missing-data-imputation.html
If you really want to plug those holes with point estimates, then you're looking to perform "imputation". I'd steer away from simple imputation methods like mean-filling since they really butcher the joint distribution of your features. Instead, try something like softImpute(which tries you infer the missing value via low-rank approximation). The original version of softImpute is written for R but I've made a Python version (along with other methods like kNN imputation) here: https://github.com/hammerlab/fancyimpute
如果你真的想用点估计来填补这些漏洞,那么你正在寻找执行“插补”。我会避开像均值填充这样的简单插补方法,因为它们确实会破坏特征的联合分布。相反,尝试像softImpute这样的东西(它尝试通过低秩近似推断缺失值)。softImpute 的原始版本是为 R 编写的,但我在这里制作了一个 Python 版本(以及其他方法,如 kNN 插补):https: //github.com/hammerlab/fancyimpute

