Python scikit-learn 中处理 nan/null 的分类器
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30317119/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
classifiers in scikit-learn that handle nan/null
提问by anthonybell
I was wondering if there are classifiers that handle nan/null values in scikit-learn. I thought random forest regressor handles this but I got an error when I call predict
.
我想知道在 scikit-learn 中是否有处理 nan/null 值的分类器。我认为随机森林回归器可以处理这个问题,但是当我调用predict
.
X_train = np.array([[1, np.nan, 3],[np.nan, 5, 6]])
y_train = np.array([1, 2])
clf = RandomForestRegressor(X_train, y_train)
X_test = np.array([7, 8, np.nan])
y_pred = clf.predict(X_test) # Fails!
Can I not call predict with any scikit-learn algorithm with missing values?
我可以不使用任何带有缺失值的 scikit-learn 算法调用 predict 吗?
Edit.Now that I think about this, it makes sense. It's not an issue during training but when you predict how do you branch when the variable is null? maybe you could just split both ways and average the result? It seems like k-NN should work fine as long as the distance function ignores nulls though.
编辑。现在回想起来,觉得很有道理。这在训练期间不是问题,但是当您预测变量为空时如何进行分支时?也许您可以将两种方式分开并平均结果?只要距离函数忽略空值,似乎 k-NN 应该可以正常工作。
Edit 2 (older and wiser me)Some gbm libraries (such as xgboost) use a ternary tree instead of a binary tree precisely for this purpose: 2 children for the yes/no decision and 1 child for the missing decision. sklearn is using a binary tree
编辑 2(我更老更聪明)一些 gbm 库(例如 xgboost)正是为此目的使用三叉树而不是二叉树:2 个孩子用于是/否决定,1 个孩子用于丢失决定。sklearn使用的是二叉树
采纳答案by bakkal
I made an example that contains both missing values in training and the test sets
我做了一个例子,其中包含训练和测试集中的缺失值
I just picked a strategy to replace missing data with the mean, using the SimpleImputer
class. There are other strategies.
我刚刚选择了一种策略,使用SimpleImputer
该类用平均值替换缺失数据。还有其他策略。
from __future__ import print_function
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
X_train = [[0, 0, np.nan], [np.nan, 1, 1]]
Y_train = [0, 1]
X_test_1 = [0, 0, np.nan]
X_test_2 = [0, np.nan, np.nan]
X_test_3 = [np.nan, 1, 1]
# Create our imputer to replace missing values with the mean e.g.
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp = imp.fit(X_train)
# Impute our data, then train
X_train_imp = imp.transform(X_train)
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X_train_imp, Y_train)
for X_test in [X_test_1, X_test_2, X_test_3]:
# Impute each test item, then predict
X_test_imp = imp.transform(X_test)
print(X_test, '->', clf.predict(X_test_imp))
# Results
[0, 0, nan] -> [0]
[0, nan, nan] -> [0]
[nan, 1, 1] -> [1]
回答by Foreever
If you are using DataFrame, you could use fillna
. Here I replaced the missing data with the mean of that column.
如果您使用的是 DataFrame,则可以使用fillna
. 在这里,我用该列的平均值替换了缺失的数据。
df.fillna(df.mean(), inplace=True)
回答by DannyDannyDanny
Short answer
简答
Sometimes missing values are simply not applicable. Imputing them is meaningless. In these cases you should use a model that can handle missing values. Scitkit-learn's models cannot handle missing values. XGBoost can.
有时缺失值根本不适用。对它们进行估算是没有意义的。在这些情况下,您应该使用可以处理缺失值的模型。Scitkit-learn 的模型无法处理缺失值。XGBoost 可以。
More on scikit-learn and XGBoost
更多关于 scikit-learn 和 XGBoost
As mentioned in this article, scikit-learn's decision trees and KNN algorithms are not (yet) robust enoughto work with missing values. If imputation doesn't make sense, don't do it.
如本文所述,scikit-learn 的决策树和 KNN 算法(还)不够健壮,无法处理缺失值。如果估算没有意义,请不要这样做。
Consider situtations when imputation doesn't make sense.
当插补没有意义时,请考虑情况。
keep in mind this is a made-up example
请记住,这是一个虚构的例子
Consider a dataset with rows of cars("Danho Diesel", "Estal Electric", "Hesproc Hybrid") and columns with their properties(Weight, Top speed, Acceleration, Power output, Sulfur Dioxide Emission, Range).
考虑一个包含汽车行(“Danho Diesel”、“Estal Electric”、“Hesproc Hybrid”)和列及其属性(重量、最高速度、加速度、功率输出、二氧化硫排放、范围)的数据集。
Electric cars do not produce exhaust fumes - so the Sulfur dioxide emission of the Estal Electricshould be a NaN
-value (missing). You could argue that it should be set to 0 - but electric cars cannot produce sulfur dioxide. Imputing the value will ruin your predictions.
电动汽车不会产生废气 - 因此Estal Electric 的二氧化硫排放量应该是一个 -值(缺失)NaN
。您可能会争辩说它应该设置为 0 - 但电动汽车不能产生二氧化硫。估算值会破坏您的预测。
As mentioned in this article, scikit-learn's decision trees and KNN algorithms are not (yet) robust enoughto work with missing values. If imputation doesn't make sense, don't do it.
如本文所述,scikit-learn 的决策树和 KNN 算法(还)不够健壮,无法处理缺失值。如果估算没有意义,请不要这样做。