Python 使用 Pandas 数据帧查找均方根误差

Question

提问by Zaynaib Giwa

I am trying to calculate the root mean squared error in from a pandas data frame. I have checked out previous links on stacked overflow such as Root mean square error in pythonand the scikit learn documentation http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.htmlI was hoping someone out there would shed some light on what I am doing wrong. Here is the dataset. Here is my code.

我正在尝试从 Pandas 数据框中计算均方根误差。我已经检查了以前关于堆栈溢出的链接，例如python 中的均方根误差和 scikit 学习文档http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html我希望有人出来这会说明我做错了什么。这是数据集。这是我的代码。

import pandas as pd
import numpy as np
sales = pd.read_csv("home_data.csv")

from sklearn.cross_validation import train_test_split
train_data,test_data = train_test_split(sales,train_size=0.8)

from sklearn.linear_model import LinearRegression
X = train_data[['sqft_living']]
y=train_data.price
#build the linear regression object
lm=LinearRegression()
# Train the model using the training sets
lm.fit(X,y)
#print the y intercept
print(lm.intercept_)
#print the coefficents
print(lm.coef_)

lm.predict(300)



from math import sqrt
from sklearn.metrics import mean_squared_error
y_true=train_data.price.loc[0:5,]
test_data=test_data[['price']].reset_index()
y_pred=test_data.price.loc[0:5,]
predicted =y_pred.as_matrix()
actual= y_true.as_matrix()
mean_squared_error(actual, predicted)

EDIT

编辑

So this is what worked for me. I had to transform the test dataset values for sqft living from row to column.

所以这对我有用。我不得不将 sqft living 的测试数据集值从行转换为列。

from sklearn.linear_model import LinearRegression
X = train_data[['sqft_living']]
y=train_data.price
#build the linear regression object
lm=LinearRegression()
# Train the model using the training sets
lm.fit(X,y)

New code

新代码

test_X = test_data.sqft_living.values
print(test_X)
print(np.shape(test_X))
print(len(test_X))
test_X = np.reshape(test_X, [4323, 1])
print(test_X)
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score
MSE = mean_squared_error(y_true = test_data.price.values, y_pred = lm.predict(test_X))
MSE
MSE**(0.5)

Answer 1

回答by jakevdp

You're comparing test-set labels to training-set labels. I believe that what you actually want to do is compare test-set labels to predictedtest-set labels.

您正在将测试集标签与训练集标签进行比较。我相信您真正想做的是将测试集标签与预测的测试集标签进行比较。

For example:

例如：

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split

sales = pd.read_csv("home_data.csv")
train_data, test_data = train_test_split(sales,train_size=0.8)

# Train the model
X = train_data[['sqft_living']]
y = train_data.price
lm = LinearRegression()
lm.fit(X, y)

# Predict on the test data
X_test = test_data[['sqft_living']]
y_test = test_data.price
y_pred = lm.predict(X_test)

# Compute the root-mean-square
rms = np.sqrt(mean_squared_error(y_test, y_pred))
print(rms)
# 260435.511036

Note that scikit-learn can in general handle Pandas DataFrames and Series inputs without explicit conversion to numpy arrays. The error in the code snippet in your question has to do with the fact that the two arrays passed to mean_squared_error()are different sizes.

请注意，scikit-learn 通常可以处理 Pandas DataFrames 和 Series 输入，而无需显式转换为 numpy 数组。您问题中代码片段中的错误与传递给的两个数组的mean_squared_error()大小不同这一事实有关。

Python 使用 Pandas 数据帧查找均方根误差

提问by Zaynaib Giwa

EDIT

编辑

New code

新代码

回答by jakevdp

相关推荐

最近更新

标签

Python 使用 Pandas 数据帧查找均方根误差

提问by Zaynaib Giwa

EDIT

编辑

New code

新代码

回答by jakevdp

相关推荐

Python 在 __init__ 之外定义的实例属性 attribute_name

Python Pandas：为什么在布尔索引后选择列需要双括号

python名称错误名称未定义

带有用户输入的 Python while 循环

相关推荐

最近更新

标签

Python 在 init 之外定义的实例属性 attribute_name