Python 使用 Pandas 数据帧查找均方根误差

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33458959/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:23:43  来源:igfitidea点击:

Finding Root Mean Squared Error with Pandas dataframe

pythonpandas

提问by Zaynaib Giwa

I am trying to calculate the root mean squared error in from a pandas data frame. I have checked out previous links on stacked overflow such as Root mean square error in pythonand the scikit learn documentation http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.htmlI was hoping someone out there would shed some light on what I am doing wrong. Here is the dataset. Here is my code.

我正在尝试从 Pandas 数据框中计算均方根误差。我已经检查了以前关于堆栈溢出的链接,例如python 中的均方根误差和 scikit 学习文档http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html我希望有人出来这会说明我做错了什么。这是数据集。这是我的代码。

import pandas as pd
import numpy as np
sales = pd.read_csv("home_data.csv")

from sklearn.cross_validation import train_test_split
train_data,test_data = train_test_split(sales,train_size=0.8)

from sklearn.linear_model import LinearRegression
X = train_data[['sqft_living']]
y=train_data.price
#build the linear regression object
lm=LinearRegression()
# Train the model using the training sets
lm.fit(X,y)
#print the y intercept
print(lm.intercept_)
#print the coefficents
print(lm.coef_)

lm.predict(300)



from math import sqrt
from sklearn.metrics import mean_squared_error
y_true=train_data.price.loc[0:5,]
test_data=test_data[['price']].reset_index()
y_pred=test_data.price.loc[0:5,]
predicted =y_pred.as_matrix()
actual= y_true.as_matrix()
mean_squared_error(actual, predicted)

EDIT

编辑

So this is what worked for me. I had to transform the test dataset values for sqft living from row to column.

所以这对我有用。我不得不将 sqft living 的测试数据集值从行转换为列。

from sklearn.linear_model import LinearRegression
X = train_data[['sqft_living']]
y=train_data.price
#build the linear regression object
lm=LinearRegression()
# Train the model using the training sets
lm.fit(X,y)

New code

新代码

test_X = test_data.sqft_living.values
print(test_X)
print(np.shape(test_X))
print(len(test_X))
test_X = np.reshape(test_X, [4323, 1])
print(test_X)
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score
MSE = mean_squared_error(y_true = test_data.price.values, y_pred = lm.predict(test_X))
MSE
MSE**(0.5)

回答by jakevdp

You're comparing test-set labels to training-set labels. I believe that what you actually want to do is compare test-set labels to predictedtest-set labels.

您正在将测试集标签与训练集标签进行比较。我相信您真正想做的是将测试集标签与预测的测试集标签进行比较。

For example:

例如:

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split

sales = pd.read_csv("home_data.csv")
train_data, test_data = train_test_split(sales,train_size=0.8)

# Train the model
X = train_data[['sqft_living']]
y = train_data.price
lm = LinearRegression()
lm.fit(X, y)

# Predict on the test data
X_test = test_data[['sqft_living']]
y_test = test_data.price
y_pred = lm.predict(X_test)

# Compute the root-mean-square
rms = np.sqrt(mean_squared_error(y_test, y_pred))
print(rms)
# 260435.511036

Note that scikit-learn can in general handle Pandas DataFrames and Series inputs without explicit conversion to numpy arrays. The error in the code snippet in your question has to do with the fact that the two arrays passed to mean_squared_error()are different sizes.

请注意,scikit-learn 通常可以处理 Pandas DataFrames 和 Series 输入,而无需显式转换为 numpy 数组。您问题中代码片段中的错误与传递给的两个数组的mean_squared_error()大小不同这一事实有关。