pandas 使用scikit-learn(sklearn),如何处理线性回归的缺失数据?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33113947/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:01:57  来源:igfitidea点击:

Using scikit-learn (sklearn), how to handle missing data for linear regression?

pythonpandasmachine-learningscikit-learnlinear-regression

提问by O.rka

I tried this but couldn't get it to work for my data: Use Scikit Learn to do linear regression on a time series pandas data frame

我试过了,但无法让它对我的数据起作用: 使用 Scikit Learn to do linear regression on a time series pandas data frame

My data consists of 2 DataFrames. DataFrame_1.shape = (40,5000)and DataFrame_2.shape = (40,74). I'm trying to do some type of linear regression, but DataFrame_2contains NaNmissing data values. When I DataFrame_2.dropna(how="any")the shape drops to (2,74).

我的数据由 2 个数据帧组成。DataFrame_1.shape = (40,5000)DataFrame_2.shape = (40,74)。我正在尝试进行某种类型的线性回归,但DataFrame_2包含NaN缺失的数据值。当我DataFrame_2.dropna(how="any")的形状下降到(2,74).

Is there any linear regression algorithm in sklearn that can handle NaNvalues?

sklearn中是否有可以处理NaN值的线性回归算法?

I'm modeling it after the load_bostonfrom sklearn.datasetswhere X,y = boston.data, boston.target = (506,13),(506,)

我在load_bostonfrom sklearn.datasetswhere之后建模X,y = boston.data, boston.target = (506,13),(506,)

Here's my simplified code:

这是我的简化代码:

X = DataFrame_1
for col in DataFrame_2.columns:
    y = DataFrame_2[col]
    model = LinearRegression()
    model.fit(X,y)

#ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I did the above format to get the shapes to match up of the matrices

我做了上面的格式来使形状与矩阵相匹配

If posting the DataFrame_2would help, please comment below and I'll add it.

如果发布有DataFrame_2帮助,请在下面发表评论,我会添加它。

采纳答案by maxymoo

You can fill in the null values in ywith imputation. In scikit-learnthis is done with the following code snippet:

您可以y使用插补填充空值。在scikit-learn此与下面的代码片段完成:

from sklearn.preprocessing import Imputer
imputer = Imputer()
y_imputed = imputer.fit_transform(y)

Otherwise, you might want to build your model using a subset of the 74 columns as predictors, perhaps some of your columns contain less null values?

否则,您可能希望使用 74 列的子集作为预测变量来构建模型,也许您的某些列包含较少的空值?

回答by Foreever

If your variable is a DataFrame, you could use fillna. Here I replaced the missing data with the mean of that column.

如果您的变量是 DataFrame,则可以使用fillna. 在这里,我用该列的平均值替换了缺失的数据。

df.fillna(df.mean(), inplace=True)