pandas 时间序列python的线性回归(numpy或pandas)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32327471/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
linear regression for timeseries python (numpy or pandas)
提问by I. F.
I am new to python and programming in general, so forgive any simple mistakes/ things that should be obvious.
我是 Python 和编程的新手,所以请原谅任何简单的错误/应该很明显的事情。
What I am trying to do is quite simple, I just want to fit a linear trend (1-d polynomial) to a bunch of time-series to see whether the slopes are positive or negative. Right now I am just trying to get it to work for one time series.
我想要做的很简单,我只想将线性趋势(一维多项式)拟合到一堆时间序列中,以查看斜率是正还是负。现在我只是想让它在一个时间序列中工作。
The problem: it seems like both pandas and numpy can't do regressions for datetimes. My date times are not regular (generally 1 day per month but not the same day) so can't use the suggestion posed in Linear Regression from Time Series Pandas
问题:似乎 pandas 和 numpy 都无法对日期时间进行回归。我的约会时间不规律(通常是每月 1 天,但不是同一天),所以不能使用Time Series Pandas 的线性回归中提出的建议
My time series csv looks like:
我的时间序列 csv 看起来像:
StationName, year, month, day, depth, NO3-N, PO4-P, TotP, TotN,
Kvarnbacken (Savaran), 2003, 2, 25, 0.5, 46, 9, 14, 451
Kvarnbacken (Savaran), 2003, 3, 18, 0.5, 64, 15, 17, 310
Kvarnbacken (Savaran), 2003, 3, 31, 0.5, 76, 7, 19, 566
so far what i have is
到目前为止我所拥有的是
import datetime as dt
from scipy import stats
import numpy as np
# read in station csv file
data = pd.read_csv('Kvarnbacken (Savaran)_2003.csv')
data.head()
# set up dates to something python can recognize
data['date'] = pd.to_datetime(data.year*10000+data.month *
100+data.day, format='%Y%m%d')
I tried
我试过
slope, intercept, r_value, p_value, std_err = stats.linregress(data.date,
data.TotP)
and got the error TypeError: ufunc add cannot use operands with types dtype('
并得到错误 TypeError: ufunc add cannot use operands with types dtype('
I also tried
我也试过
coefP = np.polyfit(data.date, data.TotP, 1)
polyP = np.poly1d(coefP)
ys = polyP(data.date)
print 'For P: coef, poly'
print coefP
print polyP
and got the same error.
并得到同样的错误。
I am guessing the easiest way around this is to do something where I just count the days since the first measurement I have and then just do a regression with days_since to the total phosphorous concentration (totP) but I am not sure of the easiest way to do that or if there was another trick.
我猜想解决这个问题的最简单方法是做一些事情,我只计算自第一次测量以来的天数,然后用 days_since 对总磷浓度 (totP) 进行回归,但我不确定最简单的方法这样做,或者如果有其他技巧。
回答by JohnE
You could convert the datetime to days in the following way.
您可以通过以下方式将日期时间转换为天数。
data['days_since'] = (data.date - pd.to_datetime('2003-02-25') ).astype('timedelta64[D]')
date days_since
0 2003-02-25 0
1 2003-03-18 21
2 2003-03-31 34
Now you should be able to regress as you did above.
现在你应该能够像上面那样回归。
slope, intercept, r_value, p_value, std_err = stats.linregress(data.days_since,
data.TotP)
slope, intercept
(0.1466591166477916, 13.977916194790488)
You might also want to consider other regression options such as the statsmodelspackage, especially if you'll be doing this sort of thing very often. (Note that x and y are reversed compared to linregress)
您可能还想考虑其他回归选项,例如statsmodels包,尤其是当您经常做此类事情时。(请注意,与 linregress 相比,x 和 y 是相反的)
import statsmodels.formula.api as smf
smf.ols( 'TotP ~ days_since', data=data ).fit().params
Intercept 13.977916
days_since 0.146659
That's just a fraction of the statsmodels output btw (use summary()instead of paramsto get the extra output.
顺便说一句,这只是 statsmodels 输出的一小部分(使用summary()而不是params获得额外的输出。

