Python 使用 Sklearn 对 Pandas DataFrame 进行线性回归(IndexError:元组索引超出范围)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29934083/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Linear Regression on Pandas DataFrame using Sklearn ( IndexError: tuple index out of range)
提问by Dinosaur
I'm new to Python and trying to perform linear regression using sklearn on a pandas dataframe. This is what I did:
我是 Python 新手,并尝试在 Pandas 数据帧上使用 sklearn 执行线性回归。这就是我所做的:
data = pd.read_csv('xxxx.csv')
After that I got a DataFrame of two columns, let's call them 'c1', 'c2'. Now I want to do linear regression on the set of (c1,c2) so I entered
之后,我得到了一个两列的 DataFrame,我们称它们为“c1”、“c2”。现在我想对 (c1,c2) 的集合进行线性回归,所以我输入了
X=data['c1'].values
Y=data['c2'].values
linear_model.LinearRegression().fit(X,Y)
which resulted in the following error
导致以下错误
IndexError: tuple index out of range
What's wrong here? Also, I'd like to know
这里有什么问题?另外我想知道
- visualize the result
- make predictions based on the result?
- 可视化结果
- 根据结果做出预测?
I've searched and browsed a large number of sites but none of them seemed to instruct beginners on the proper syntax. Perhaps what's obvious to experts is not so obvious to a novice like myself.
我搜索并浏览了大量网站,但似乎没有一个网站能指导初学者正确使用语法。也许对专家来说显而易见的东西对于像我这样的新手来说并不那么明显。
Can you please help? Thank you very much for your time.
你能帮忙吗?非常感谢您的宝贵时间。
PS: I have noticed that a large number of beginner questions were down-voted in stackoverflow. Kindly take into account the fact that things that seem obvious to an expert user may take a beginner days to figure out. Please use discretion when pressing the down arrow lest you'd harm the vibrancy of this discussion community.
PS:我注意到在 stackoverflow 中有大量初学者的问题被否决了。请考虑这样一个事实,对专家用户来说似乎很明显的事情可能需要初学者几天才能弄清楚。在按下向下箭头时请谨慎使用,以免损害此讨论社区的活力。
回答by Tommy
You really should have a look at the docs for the fit
method which you can view here
你真的应该看看fit
你可以在这里查看的方法的文档
For how to visualize a linear regression, play with the example here. I'm guessing you haven't used ipython (Now called jupyter) much either, so you should definitely invest some time into learning that. It's a great tool for exploring data and machine learning. You can literally copy/paste the example from scikit linear regression into an ipython notebook and run it
有关如何可视化线性回归,请使用此处的示例。我猜你也没有经常使用 ipython(现在称为 jupyter),所以你绝对应该花一些时间来学习它。它是探索数据和机器学习的绝佳工具。您可以将 scikit 线性回归中的示例从字面上复制/粘贴到 ipython 笔记本中并运行它
For your specific problem with the fit
method, by referring to the docs, you can see that the format of the data you are passing in for your X
values is wrong.
对于该fit
方法的具体问题,通过参考文档,您可以看到您为X
值传递的数据格式是错误的。
Per the docs, "X : numpy array or sparse matrix of shape [n_samples,n_features]"
根据文档,“X:形状为 [n_samples,n_features] 的 numpy 数组或稀疏矩阵”
You can fix your code with this
你可以用这个修复你的代码
X = [[x] for x in data['c1'].values]
回答by Scott
Let's assume your csv looks something like:
让我们假设您的 csv 看起来像:
c1,c2
0.000000,0.968012
1.000000,2.712641
2.000000,11.958873
3.000000,10.889784
...
I generated the data as such:
我生成了这样的数据:
import numpy as np
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
length = 10
x = np.arange(length, dtype=float).reshape((length, 1))
y = x + (np.random.rand(length)*10).reshape((length, 1))
This data is saved to test.csv (just so you know where it came from, obviously you'll use your own).
这个数据被保存到 test.csv(只是为了让你知道它来自哪里,显然你会使用你自己的)。
data = pd.read_csv('test.csv', index_col=False, header=0)
x = data.c1.values
y = data.c2.values
print x # prints: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
You need to take a look at the shape of the data you are feeding into .fit()
.
您需要查看您输入的数据的形状.fit()
。
Here x.shape = (10,)
but we need it to be (10, 1)
, see sklearn. Same goes for y
. So we reshape:
在这里,x.shape = (10,)
但我们需要它(10, 1)
,请参阅sklearn。也一样y
。所以我们重塑:
x = x.reshape(length, 1)
y = y.reshape(length, 1)
Now we create the regression object and then call fit()
:
现在我们创建回归对象,然后调用fit()
:
regr = linear_model.LinearRegression()
regr.fit(x, y)
# plot it as in the example at http://scikit-learn.org/
plt.scatter(x, y, color='black')
plt.plot(x, regr.predict(x), color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
See sklearn linear regression example.
请参阅 sklearn 线性回归示例。
回答by serv-inc
make predictions based on the result?
根据结果做出预测?
To predict,
为了预测,
lr = linear_model.LinearRegression().fit(X,Y)
lr.predict(X)
Is there any way I can view details of the regression?
有什么办法可以查看回归的详细信息吗?
The LinearRegression has coef_
and intercept_
attributes.
LinearRegression 具有coef_
和intercept_
属性。
lr.coef_
lr.intercept_
show the slope and intercept.
显示斜率和截距。
回答by Samrat Kishore
Dataset
数据集
Importing the libraries
导入库
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
Importing the dataset
导入数据集
dataset = pd.read_csv('1.csv')
X = dataset[["mark1"]]
y = dataset[["mark2"]]
Fitting Simple Linear Regression to the set
将简单线性回归拟合到集合
regressor = LinearRegression()
regressor.fit(X, y)
Predicting the set results
预测设定结果
y_pred = regressor.predict(X)
Visualising the set results
可视化设置结果
plt.scatter(X, y, color = 'red')
plt.plot(X, regressor.predict(X), color = 'blue')
plt.title('mark1 vs mark2')
plt.xlabel('mark1')
plt.ylabel('mark2')
plt.show()
回答by seralouk
I post an answer that addresses exactly the error that you got:
我发布了一个确切解决您遇到的错误的答案:
IndexError: tuple index out of range
IndexError:元组索引超出范围
Scikit-learn expects 2D inputs. Just reshape the X
and Y
.
Scikit-learn 需要 2D 输入。只需重塑X
和Y
。
Replace:
代替:
X=data['c1'].values # this has shape (XXX, ) - It's 1D
Y=data['c2'].values # this has shape (XXX, ) - It's 1D
linear_model.LinearRegression().fit(X,Y)
with
和
X=data['c1'].values.reshape(-1,1) # this has shape (XXX, 1) - it's 2D
Y=data['c2'].values.reshape(-1,1) # this has shape (XXX, 1) - it's 2D
linear_model.LinearRegression().fit(X,Y)