Python 将 Pandas 数据帧转换为数组并评估多元线性回归模型

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28334091/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:07:17  来源:igfitidea点击:

Turning a Pandas Dataframe to an array and evaluate Multiple Linear Regression Model

pythonnumpypandasmachine-learning

提问by Batuhan B

I am trying to evaluate a multiple linear regression model. I have a data set like this :

我正在尝试评估多元线性回归模型。我有一个这样的数据集:

enter image description here

在此处输入图片说明

This data set has 157 rows * 54 columns.

这个数据集有 157 行 * 54 列。

I need to predict ground_truth value from articles. I will add my multiple linear model 7 articles between en_Amantadinewith en_Common.

我需要从文章中预测 ground_truth 值。我将在en_Amantadineen_Common之间添加我的多重线性模型 7 篇文章。

I have code for multiple linear regression :

我有多元线性回归的代码:

from sklearn.linear_model import LinearRegression
X = [[6, 2], [8, 1], [10, 0], [14, 2], [18, 0]] // need to modify for my problem
y = [[7],[9],[13],[17.5], [18]] // need to modify
model = LinearRegression()
model.fit(X, y)

My problem is, I cannot extract data from my DataFrame for Xand yvariables. In my code X should be:

我的问题是,我无法从 DataFrame 中提取Xy变量的数据。在我的代码中 X 应该是:

X = [[4984, 94, 2837, 857, 356, 1678, 29901],
     [4428, 101, 4245, 906, 477, 2313, 34176],
      ....
     ]
y = [[3.135999], [2.53356] ....]

I cannot convert DataFrame to this type of structure. How can i do this ?

我无法将 DataFrame 转换为这种类型的结构。我怎样才能做到这一点 ?

Any help is appreciated.

任何帮助表示赞赏。

采纳答案by JAB

You can turn the dataframe into a matrix using the method as_matrixdirectly on the dataframe object. You might need to specify the columns which you are interested in X=df[['x1','x2','X3']].as_matrix()where the different x's are the column names.

您可以as_matrix直接在数据帧对象上使用该方法将数据帧转换为矩阵。您可能需要指定您感兴趣的列,X=df[['x1','x2','X3']].as_matrix()其中不同的 x 是列名。

For the y variables you can use y = df['ground_truth'].valuesto get an array.

对于 y 变量,您可以使用它y = df['ground_truth'].values来获取数组。

Here is an example with some randomly generated data:

以下是一些随机生成数据的示例:

import numpy as np
#create a 5X5 dataframe
df = pd.DataFrame(np.random.random_integers(0, 100, (5, 5)), columns = ['X1','X2','X3','X4','y'])

calling as_matrix()on dfreturns a numpy.ndarrayobject

调用as_matrix()df返回一个numpy.ndarray对象

X = df[['X1','X2','X3','X4']].as_matrix()

Calling valuesreturns a numpy.ndarrayfrom a pandas series

调用从熊猫values返回 anumpy.ndarrayseries

y =df['y'].values

Notice: You might get a warning saying:FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.

注意:您可能会收到一条警告说:FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.

To fix it use valuesinstead of as_matrixas shown below

要修复它,请使用values而不是as_matrix如下所示

X = df[['X1','X2','X3','X4']].values

回答by Tanmoy

y = broken_df.ground_truth.values
X = broken_df.drop('ground_truth', axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
print(linreg.score(X_test, y_test)
print(classification_report(y_test, y_pred))