Python Pandas 线性回归 groupby
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41511945/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python pandas linear regression groupby
提问by jeangelj
I am trying to use a linear regression on a group by pandas python dataframe:
我正在尝试通过 pandas python 数据框对一组使用线性回归:
This is the dataframe df:
这是数据框 df:
group date value
A 01-02-2016 16
A 01-03-2016 15
A 01-04-2016 14
A 01-05-2016 17
A 01-06-2016 19
A 01-07-2016 20
B 01-02-2016 16
B 01-03-2016 13
B 01-04-2016 13
C 01-02-2016 16
C 01-03-2016 16
#import standard packages
import pandas as pd
import numpy as np
#import ML packages
from sklearn.linear_model import LinearRegression
#First, let's group the data by group
df_group = df.groupby('group')
#Then, we need to change the date to integer
df['date'] = pd.to_datetime(df['date'])
df['date_delta'] = (df['date'] - df['date'].min()) / np.timedelta64(1,'D')
Now I want to predict the value for each group for 01-10-2016.
现在我想预测 01-10-2016 的每个组的值。
I want to get to a new dataframe like this:
我想获得这样的新数据框:
group 01-10-2016
A predicted value
B predicted value
C predicted value
This How to apply OLS from statsmodels to groupbydoesn't work
这个如何将 OLS 从 statsmodels 应用到 groupby不起作用
for group in df_group.groups.keys():
df= df_group.get_group(group)
X = df['date_delta']
y = df['value']
model = LinearRegression(y, X)
results = model.fit(X, y)
print results.summary()
I get the following error
我收到以下错误
ValueError: Found arrays with inconsistent numbers of samples: [ 1 52]
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning)
UPDATE:
更新:
I changed it to
我把它改成
for group in df_group.groups.keys():
df= df_group.get_group(group)
X = df[['date_delta']]
y = df.value
model = LinearRegression(y, X)
results = model.fit(X, y)
print results.summary()
and now I get this error:
现在我收到这个错误:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
采纳答案by piRSquared
New Answer
新答案
def model(df, delta):
y = df[['value']].values
X = df[['date_delta']].values
return np.squeeze(LinearRegression().fit(X, y).predict(delta))
def group_predictions(df, date):
date = pd.to_datetime(date)
df.date = pd.to_datetime(df.date)
day = np.timedelta64(1, 'D')
mn = df.date.min()
df['date_delta'] = df.date.sub(mn).div(day)
dd = (date - mn) / day
return df.groupby('group').apply(model, delta=dd)
demo
演示
group_predictions(df, '01-10-2016')
group
A 22.333333333333332
B 3.500000000000007
C 16.0
dtype: object
Old Answer
旧答案
You're using LinearRegression
wrong.
你用LinearRegression
错了。
- you don't call it with the data andfit with the data. Just call the class like this
model = LinearRegression()
- then
fit
withmodel.fit(X, y)
- 你不会用数据调用它并适合数据。就像这样打电话给班级
model = LinearRegression()
- 然后
fit
用model.fit(X, y)
But all that does is set value in the object stored in model
There is no nice summary
method. There probably is one somewhere, but I know the one in statsmodels
soooo, see below
但是所做的只是在存储的对象中设置值model
没有很好的summary
方法。某处可能有一个,但我知道statsmodels
soooo 中的一个,见下文
option 1
use statsmodels
instead
选项 1改为
使用statsmodels
from statsmodels.formula.api import ols
for k, g in df_group:
model = ols('value ~ date_delta', g)
results = model.fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: value R-squared: 0.652
Model: OLS Adj. R-squared: 0.565
Method: Least Squares F-statistic: 7.500
Date: Fri, 06 Jan 2017 Prob (F-statistic): 0.0520
Time: 10:48:17 Log-Likelihood: -9.8391
No. Observations: 6 AIC: 23.68
Df Residuals: 4 BIC: 23.26
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 14.3333 1.106 12.965 0.000 11.264 17.403
date_delta 1.0000 0.365 2.739 0.052 -0.014 2.014
==============================================================================
Omnibus: nan Durbin-Watson: 1.393
Prob(Omnibus): nan Jarque-Bera (JB): 0.461
Skew: -0.649 Prob(JB): 0.794
Kurtosis: 2.602 Cond. No. 5.78
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: value R-squared: 0.750
Model: OLS Adj. R-squared: 0.500
Method: Least Squares F-statistic: 3.000
Date: Fri, 06 Jan 2017 Prob (F-statistic): 0.333
Time: 10:48:17 Log-Likelihood: -3.2171
No. Observations: 3 AIC: 10.43
Df Residuals: 1 BIC: 8.631
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 15.5000 1.118 13.864 0.046 1.294 29.706
date_delta -1.5000 0.866 -1.732 0.333 -12.504 9.504
==============================================================================
Omnibus: nan Durbin-Watson: 3.000
Prob(Omnibus): nan Jarque-Bera (JB): 0.531
Skew: -0.707 Prob(JB): 0.767
Kurtosis: 1.500 Cond. No. 2.92
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: value R-squared: -inf
Model: OLS Adj. R-squared: -inf
Method: Least Squares F-statistic: -0.000
Date: Fri, 06 Jan 2017 Prob (F-statistic): nan
Time: 10:48:17 Log-Likelihood: 63.481
No. Observations: 2 AIC: -123.0
Df Residuals: 0 BIC: -125.6
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 16.0000 inf 0 nan nan nan
date_delta -3.553e-15 inf -0 nan nan nan
==============================================================================
Omnibus: nan Durbin-Watson: 0.400
Prob(Omnibus): nan Jarque-Bera (JB): 0.333
Skew: 0.000 Prob(JB): 0.846
Kurtosis: 1.000 Cond. No. 2.62
==============================================================================
回答by Wizytor
As a newbie I cannot comment so I will write it as a new answer. To solve an error:
作为新手,我无法发表评论,因此我将其写为新答案。要解决错误:
Runtime Error: ValueError : Expected 2D array, got scalar array instead
you need to reshape delta value in line:
您需要在线重塑增量值:
return np.squeeze(LinearRegression().fit(X, y).predict(np.array(delta).reshape(1, -1)))
Credit stays for you piRSquared
信用为您保留 piRSquared