Python scikit-learn 中处理 nan/null 的分类器

Question

提问by anthonybell

I was wondering if there are classifiers that handle nan/null values in scikit-learn. I thought random forest regressor handles this but I got an error when I call predict.

我想知道在 scikit-learn 中是否有处理 nan/null 值的分类器。我认为随机森林回归器可以处理这个问题，但是当我调用predict.

X_train = np.array([[1, np.nan, 3],[np.nan, 5, 6]])
y_train = np.array([1, 2])
clf = RandomForestRegressor(X_train, y_train)
X_test = np.array([7, 8, np.nan])
y_pred = clf.predict(X_test) # Fails!

Can I not call predict with any scikit-learn algorithm with missing values?

我可以不使用任何带有缺失值的 scikit-learn 算法调用 predict 吗？

Edit.Now that I think about this, it makes sense. It's not an issue during training but when you predict how do you branch when the variable is null? maybe you could just split both ways and average the result? It seems like k-NN should work fine as long as the distance function ignores nulls though.

编辑。现在回想起来，觉得很有道理。这在训练期间不是问题，但是当您预测变量为空时如何进行分支时？也许您可以将两种方式分开并平均结果？只要距离函数忽略空值，似乎 k-NN 应该可以正常工作。

Edit 2 (older and wiser me)Some gbm libraries (such as xgboost) use a ternary tree instead of a binary tree precisely for this purpose: 2 children for the yes/no decision and 1 child for the missing decision. sklearn is using a binary tree

编辑 2（我更老更聪明）一些 gbm 库（例如 xgboost）正是为此目的使用三叉树而不是二叉树：2 个孩子用于是/否决定，1 个孩子用于丢失决定。sklearn使用的是二叉树

Answer 1

采纳答案by bakkal

I made an example that contains both missing values in training and the test sets

我做了一个例子，其中包含训练和测试集中的缺失值

I just picked a strategy to replace missing data with the mean, using the SimpleImputerclass. There are other strategies.

我刚刚选择了一种策略，使用SimpleImputer该类用平均值替换缺失数据。还有其他策略。

from __future__ import print_function

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer


X_train = [[0, 0, np.nan], [np.nan, 1, 1]]
Y_train = [0, 1]
X_test_1 = [0, 0, np.nan]
X_test_2 = [0, np.nan, np.nan]
X_test_3 = [np.nan, 1, 1]

# Create our imputer to replace missing values with the mean e.g.
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp = imp.fit(X_train)

# Impute our data, then train
X_train_imp = imp.transform(X_train)
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X_train_imp, Y_train)

for X_test in [X_test_1, X_test_2, X_test_3]:
    # Impute each test item, then predict
    X_test_imp = imp.transform(X_test)
    print(X_test, '->', clf.predict(X_test_imp))

# Results
[0, 0, nan] -> [0]
[0, nan, nan] -> [0]
[nan, 1, 1] -> [1]

Answer 2

回答by Foreever

If you are using DataFrame, you could use fillna. Here I replaced the missing data with the mean of that column.

如果您使用的是 DataFrame，则可以使用fillna. 在这里，我用该列的平均值替换了缺失的数据。

df.fillna(df.mean(), inplace=True)

Answer 3

回答by DannyDannyDanny

Short answer

简答

Sometimes missing values are simply not applicable. Imputing them is meaningless. In these cases you should use a model that can handle missing values. Scitkit-learn's models cannot handle missing values. XGBoost can.

有时缺失值根本不适用。对它们进行估算是没有意义的。在这些情况下，您应该使用可以处理缺失值的模型。Scitkit-learn 的模型无法处理缺失值。XGBoost 可以。

More on scikit-learn and XGBoost

Consider situtations when imputation doesn't make sense.

当插补没有意义时，请考虑情况。

keep in mind this is a made-up example

请记住，这是一个虚构的例子

Consider a dataset with rows of cars("Danho Diesel", "Estal Electric", "Hesproc Hybrid") and columns with their properties(Weight, Top speed, Acceleration, Power output, Sulfur Dioxide Emission, Range).

考虑一个包含汽车行（“Danho Diesel”、“Estal Electric”、“Hesproc Hybrid”）和列及其属性（重量、最高速度、加速度、功率输出、二氧化硫排放、范围）的数据集。

Electric cars do not produce exhaust fumes - so the Sulfur dioxide emission of the Estal Electricshould be a NaN-value (missing). You could argue that it should be set to 0 - but electric cars cannot produce sulfur dioxide. Imputing the value will ruin your predictions.

电动汽车不会产生废气 - 因此Estal Electric 的二氧化硫排放量应该是一个 -值（缺失）NaN。您可能会争辩说它应该设置为 0 - 但电动汽车不能产生二氧化硫。估算值会破坏您的预测。

As mentioned in this article, scikit-learn's decision trees and KNN algorithms are not (yet) robust enoughto work with missing values. If imputation doesn't make sense, don't do it.

如本文所述，scikit-learn 的决策树和 KNN 算法（还）不够健壮，无法处理缺失值。如果估算没有意义，请不要这样做。

Python scikit-learn 中处理 nan/null 的分类器

提问by anthonybell

采纳答案by bakkal

回答by Foreever

回答by DannyDannyDanny

Short answer

简答

More on scikit-learn and XGBoost

更多关于 scikit-learn 和 XGBoost

Consider situtations when imputation doesn't make sense.

当插补没有意义时，请考虑情况。

相关推荐

最近更新

标签

Python scikit-learn 中处理 nan/null 的分类器

提问by anthonybell

采纳答案by bakkal

回答by Foreever

回答by DannyDannyDanny

Short answer

简答

More on scikit-learn and XGBoost

更多关于 scikit-learn 和 XGBoost

Consider situtations when imputation doesn't make sense.

当插补没有意义时，请考虑情况。

相关推荐

Python错误：找不到命令

Python 使用 os.path.join() 构建绝对路径

通过 Tor 使用 Python 发出请求

Python 如何在不使用“|”的情况下将两组合并为一行

相关推荐

最近更新

标签