pandas 使用scikit-learn（sklearn），如何处理线性回归的缺失数据？

Question

提问by O.rka

I tried this but couldn't get it to work for my data: Use Scikit Learn to do linear regression on a time series pandas data frame

我试过了，但无法让它对我的数据起作用：使用 Scikit Learn to do linear regression on a time series pandas data frame

My data consists of 2 DataFrames. DataFrame_1.shape = (40,5000)and DataFrame_2.shape = (40,74). I'm trying to do some type of linear regression, but DataFrame_2contains NaNmissing data values. When I DataFrame_2.dropna(how="any")the shape drops to (2,74).

我的数据由 2 个数据帧组成。DataFrame_1.shape = (40,5000)和DataFrame_2.shape = (40,74)。我正在尝试进行某种类型的线性回归，但DataFrame_2包含NaN缺失的数据值。当我DataFrame_2.dropna(how="any")的形状下降到(2,74).

Is there any linear regression algorithm in sklearn that can handle NaNvalues?

sklearn中是否有可以处理NaN值的线性回归算法？

I'm modeling it after the load_bostonfrom sklearn.datasetswhere X,y = boston.data, boston.target = (506,13),(506,)

我在load_bostonfrom sklearn.datasetswhere之后建模X,y = boston.data, boston.target = (506,13),(506,)

Here's my simplified code:

这是我的简化代码：

X = DataFrame_1
for col in DataFrame_2.columns:
    y = DataFrame_2[col]
    model = LinearRegression()
    model.fit(X,y)

#ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I did the above format to get the shapes to match up of the matrices

我做了上面的格式来使形状与矩阵相匹配

If posting the DataFrame_2would help, please comment below and I'll add it.

如果发布有DataFrame_2帮助，请在下面发表评论，我会添加它。

Answer 1

采纳答案by maxymoo

You can fill in the null values in ywith imputation. In scikit-learnthis is done with the following code snippet:

您可以y使用插补填充空值。在scikit-learn此与下面的代码片段完成：

from sklearn.preprocessing import Imputer
imputer = Imputer()
y_imputed = imputer.fit_transform(y)

Otherwise, you might want to build your model using a subset of the 74 columns as predictors, perhaps some of your columns contain less null values?

否则，您可能希望使用 74 列的子集作为预测变量来构建模型，也许您的某些列包含较少的空值？

Answer 2

回答by Foreever

If your variable is a DataFrame, you could use fillna. Here I replaced the missing data with the mean of that column.

如果您的变量是 DataFrame，则可以使用fillna. 在这里，我用该列的平均值替换了缺失的数据。

df.fillna(df.mean(), inplace=True)

pandas 使用scikit-learn（sklearn），如何处理线性回归的缺失数据？

提问by O.rka

采纳答案by maxymoo

回答by Foreever

相关推荐

最近更新

标签

pandas 使用scikit-learn（sklearn），如何处理线性回归的缺失数据？

提问by O.rka

采纳答案by maxymoo

回答by Foreever

相关推荐

pandas ValueError：无法将大小为 5 的序列复制到维度为 2 的数组轴

使用包含空格的列名查询 Pandas DataFrame 或使用包含空格的列名使用 drop 方法

在 Pandas 中转置 DataFrame，同时保留索引列

在 Python Pandas 中查找 ID 的最小值、最大值和平均值

相关推荐

最近更新

标签