使用 Scikit Learn 对时间序列 Pandas 数据框进行线性回归
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29748717/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Use Scikit Learn to do linear regression on a time series pandas data frame
提问by Ivan
I'm trying to do a simple linear regression on a pandas data frame using scikit learn linear regressor. My data is a time series, and the pandas data frame has a datetime index:
我正在尝试使用 scikit learn linear regressor 对 Pandas 数据框进行简单的线性回归。我的数据是一个时间序列,pandas 数据框有一个日期时间索引:
value
2007-01-01 0.771305
2007-02-01 0.256628
2008-01-01 0.670920
2008-02-01 0.098047
Doing something simple as
做一些简单的事情
from sklearn import linear_model
lr = linear_model.LinearRegression()
lr(data.index, data['value'])
didn't work:
没有用:
float() argument must be a string or a number
So I tried to create a new column with the dates to try to transform it:
所以我尝试用日期创建一个新列来尝试转换它:
data['date'] = data.index
data['date'] = pd.to_datetime(data['date'])
lr(data['date'], data['value'])
but now I get:
但现在我得到:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
So the regressor can't handle datetime. I saw a bunch of ways to convert integer data to datetime, but couldn't find a way to convert from datetime to integer, for example.
所以回归器无法处理日期时间。例如,我看到了很多将整数数据转换为日期时间的方法,但找不到将日期时间转换为整数的方法。
What is the proper way to do this?
这样做的正确方法是什么?
PS: I'm interested in using scikit because I'm planning on doing more stuff with it later, so no statsmodels for now.
PS:我对使用 scikit 很感兴趣,因为我打算以后用它做更多的事情,所以现在没有 statsmodels。
回答by TomAugspurger
You probably want something like the number of days since the start to be your predictor here. Assuming everything is sorted:
您可能希望自开始以来的天数作为您的预测指标。假设一切都已排序:
In [36]: X = (df.index - df.index[0]).days.reshape(-1, 1)
In [37]: y = df['value'].values
In [38]: linear_model.LinearRegression().fit(X, y)
Out[38]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
The exact units you use for the predictor don't really matter, it could be days or months. The coefficients and interpretation will change so that everything works out to the same result. Also, notice that we needed a reshape(-1, 1)so that the Xis in the expected format.
您用于预测器的确切单位并不重要,可能是几天或几个月。系数和解释会发生变化,因此一切都会得到相同的结果。另外,请注意我们需要 areshape(-1, 1)以便它X采用预期的格式。

