pandas 使用机器学习预测 NA（缺失值）

Question

提问by mayyer

I have a huge data set and want to predict (not replace) missing values with a machine learning algorithm like svm or random forest in python.

我有一个庞大的数据集，想用机器学习算法（如 svm 或 python 中的随机森林）预测（而不是替换）缺失值。

My data set looks like this:

我的数据集如下所示：

ID i0   i1    i2    i3    i4   i5     j0    j1   j2   j3    j4    j5    

0  0.19 -0.02 -0.20 0.07 -0.06 -0.06  -0.06 1.48 0.33 -0.46 -0.37 -0.11
1 -0.61 -0.19 -0.10 -0.1 -0.21  0.63   NA    NA   NA   NA    NA    NA
2 -0.31 -0.14 -0.64 -0.5 -0.20 -0.30  -0.08 1.56 -0.2 -0.33  0.81 -0.03
.
.

What I want to do:
On the basis of ID 0 and 2 I want to train the values of j0 to j5 with i0 to i5. Subsequent there should be a prediction of the NA's from j0-j5 for ID 1.

我想要做什么：
在 ID 0 和 2 的基础上，我想用 i0 到 i5 训练 j0 到 j5 的值。随后应该对 ID 1 的 j0-j5 的 NA 进行预测。

Question:
As the data is not continuous (the time steps end at i5 and start again at j0), is it possible to use some kind of regression?

问题：
由于数据不连续（时间步长在 i5 处结束并在 j0 处重新开始），是否可以使用某种回归？

How should the X and the y for the .fit(X, y) and .predict(X) function look like in this example?

在这个例子中，.fit(X, y) 和 .predict(X) 函数的 X 和 y 应该是什么样的？

Answer 1

回答by Julien Marrec

In your case, you're looking at at a multi-output regressionproblem:

在您的情况下，您正在查看多输出回归问题：

A regressionproblem - as opposed to classification - since you are trying to predict a value and not a class/state variable/category
Multi-outputsince you are trying to predict 6 values for each data point

一个回归问题-而不是分类-因为你正试图预测值，而不是一类/状态变量/类别
多输出，因为您试图为每个数据点预测 6 个值

You can read more in the sklearn documentation about multiclass.

你可以阅读更多有关sklearn文档中多类。

Here I'm going to show you how you can use sklearn.multioutput.MultiOutputRegressorwith a sklearn.ensemble.RandomForestRegressorto predict your values.

在这里，我将向您展示如何使用sklearn.multioutput.MultiOutputRegressor和sklearn.ensemble.RandomForestRegressor来预测您的值。

Construct some dummy data

构造一些虚拟数据

from sklearn.datasets import make_regression

X,y = make_regression(n_samples=1000, n_features=6,
                                 n_informative=3, n_targets=6,  
                                 tail_strength=0.5, noise=0.02, 
                                 shuffle=True, coef=False, random_state=0)

# Convert to a pandas dataframe like in your example
icols = ['i0','i1','i2','i3','i4','i5']
jcols = ['j0', 'j1', 'j2', 'j3', 'j4', 'j5']
df = pd.concat([pd.DataFrame(X, columns=icols),
                pd.DataFrame(y, columns=jcols)], axis=1)

# Introduce a few np.nans in there
df.loc[0, jcols] = np.nan
df.loc[10, jcols] = np.nan
df.loc[100, jcols] = np.nan

df.head()

Out:
     i0    i1    i2    i3    i4    i5     j0     j1     j2     j3     j4  \
0 -0.21 -0.18 -0.06  0.27 -0.32  0.00    NaN    NaN    NaN    NaN    NaN   
1  0.65 -2.16  0.46  1.82  0.22 -0.13  33.08  39.85   9.63  13.52  16.72   
2 -0.75 -0.52 -1.08  0.14  1.12 -1.05  -0.96 -96.02  14.37  25.19 -44.90   
3  0.01  0.62  0.20  0.53  0.35 -0.73   6.09 -12.07 -28.88  10.49   0.96   
4  0.39 -0.70 -0.55  0.10  1.65 -0.69  83.15  -3.16  93.61  57.44 -17.33   

      j5  
0    NaN  
1  17.79  
2 -77.48  
3 -35.61  
4  -2.47

Exclude the nans initially, and split into 75% train and 25% test

最初排除 nans，并分成 75% 的训练和 25% 的测试

The split is done in order to be able to validate our model.

进行拆分是为了能够验证我们的模型。

notnans = df[jcols].notnull().all(axis=1)
df_notnans = df[notnans]

# Split into 75% train and 25% test
X_train, X_test, y_train, y_test = train_test_split(df_notnans[icols], df_notnans[jcols],
                                                    train_size=0.75,
                                                    random_state=4)

Use a multi output regression based on a random forest regressor

使用基于随机森林回归器的多输出回归

from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import train_test_split

regr_multirf = MultiOutputRegressor(RandomForestRegressor(max_depth=30,
                                                          random_state=0))

# Fit on the train data
regr_multirf.fit(X_train, y_train)

# Check the prediction score
score = regr_multirf.score(X_test, y_test)
print("The prediction score on the test data is {:.2f}%".format(score*100))

Out: The prediction score on the test data is 96.76%

Predict the nan rows

预测 nan 行

df_nans = df.loc[~notnans].copy()
df_nans[jcols] = regr_multirf.predict(df_nans[icols])
df_nans

Out:

出去：

           i0        i1        i2        i3        i4        i5         j0  \
0   -0.211620 -0.177927 -0.062205  0.267484 -0.317349  0.000341 -41.254983   
10   1.138974 -1.326378  0.123960  0.982841  0.273958  0.414307  46.406351   
100 -0.682390 -1.431414 -0.328235 -0.886463  1.212363 -0.577676  94.971966   

            j1         j2         j3         j4         j5  
0   -18.197513 -31.029952 -14.749244  -5.990595  -9.296744  
10   67.915628  59.750032  15.612843  10.177314  38.226387  
100  -3.724223  65.630692  44.636895 -14.372414  11.947185

pandas 使用机器学习预测 NA（缺失值）

提问by mayyer

回答by Julien Marrec

Construct some dummy data

构造一些虚拟数据

Exclude the nans initially, and split into 75% train and 25% test

最初排除 nans，并分成 75% 的训练和 25% 的测试

Use a multi output regression based on a random forest regressor

使用基于随机森林回归器的多输出回归

Predict the nan rows

预测 nan 行

相关推荐

最近更新

标签

pandas 使用机器学习预测 NA（缺失值）

提问by mayyer

回答by Julien Marrec

Construct some dummy data

构造一些虚拟数据

Exclude the nans initially, and split into 75% train and 25% test

最初排除 nans，并分成 75% 的训练和 25% 的测试

Use a multi output regression based on a random forest regressor

使用基于随机森林回归器的多输出回归

Predict the nan rows

预测 nan 行

相关推荐

pandas 如何在 dask DataFrame 上调用 unique()

pandas 数据框到 mysql db 错误数据库风格 mysql 不受支持

pandas 断言错误：通过了 22 列，传递的数据有 21 列

pandas 熊猫重命名索引值

相关推荐

最近更新

标签