Python Pandas 线性回归 groupby

Question

提问by jeangelj

I am trying to use a linear regression on a group by pandas python dataframe:

我正在尝试通过 pandas python 数据框对一组使用线性回归：

This is the dataframe df:

这是数据框 df：

  group      date      value
    A     01-02-2016     16 
    A     01-03-2016     15 
    A     01-04-2016     14 
    A     01-05-2016     17 
    A     01-06-2016     19 
    A     01-07-2016     20 
    B     01-02-2016     16 
    B     01-03-2016     13 
    B     01-04-2016     13 
    C     01-02-2016     16 
    C     01-03-2016     16 

#import standard packages
import pandas as pd
import numpy as np

#import ML packages
from sklearn.linear_model import LinearRegression

#First, let's group the data by group
df_group = df.groupby('group')

#Then, we need to change the date to integer
df['date'] = pd.to_datetime(df['date'])  
df['date_delta'] = (df['date'] - df['date'].min())  / np.timedelta64(1,'D')

Now I want to predict the value for each group for 01-10-2016.

现在我想预测 01-10-2016 的每个组的值。

I want to get to a new dataframe like this:

我想获得这样的新数据框：

group      01-10-2016
  A      predicted value
  B      predicted value
  C      predicted value

This How to apply OLS from statsmodels to groupbydoesn't work

这个如何将 OLS 从 statsmodels 应用到 groupby不起作用

for group in df_group.groups.keys():
      df= df_group.get_group(group)
      X = df['date_delta'] 
      y = df['value']
      model = LinearRegression(y, X)
      results = model.fit(X, y)
      print results.summary()

I get the following error

我收到以下错误

ValueError: Found arrays with inconsistent numbers of samples: [ 1 52]

DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and   willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning)

UPDATE:

更新：

I changed it to

我把它改成

for group in df_group.groups.keys():
      df= df_group.get_group(group)
      X = df[['date_delta']]
      y = df.value
      model = LinearRegression(y, X)
      results = model.fit(X, y)
      print results.summary()

and now I get this error:

现在我收到这个错误：

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Answer 1

采纳答案by piRSquared

New Answer

新答案

def model(df, delta):
    y = df[['value']].values
    X = df[['date_delta']].values
    return np.squeeze(LinearRegression().fit(X, y).predict(delta))

def group_predictions(df, date):
    date = pd.to_datetime(date)
    df.date = pd.to_datetime(df.date)

    day = np.timedelta64(1, 'D')
    mn = df.date.min()
    df['date_delta'] = df.date.sub(mn).div(day)

    dd = (date - mn) / day

    return df.groupby('group').apply(model, delta=dd)

demo

演示

group_predictions(df, '01-10-2016')

group
A    22.333333333333332
B     3.500000000000007
C                  16.0
dtype: object

Old Answer

旧答案

You're using LinearRegressionwrong.

你用LinearRegression错了。

you don't call it with the data andfit with the data. Just call the class like this
- model = LinearRegression()
then fitwith
- model.fit(X, y)

你不会用数据调用它并适合数据。就像这样打电话给班级
- model = LinearRegression()
然后fit用
- model.fit(X, y)

But all that does is set value in the object stored in modelThere is no nice summarymethod. There probably is one somewhere, but I know the one in statsmodelssoooo, see below

但是所做的只是在存储的对象中设置值model没有很好的summary方法。某处可能有一个，但我知道statsmodelssoooo 中的一个，见下文

option 1
use statsmodelsinstead

选项 1改为
使用statsmodels

from statsmodels.formula.api import ols

for k, g in df_group:
    model = ols('value ~ date_delta', g)
    results = model.fit()
    print(results.summary())

                        OLS Regression Results                            
==============================================================================
Dep. Variable:                  value   R-squared:                       0.652
Model:                            OLS   Adj. R-squared:                  0.565
Method:                 Least Squares   F-statistic:                     7.500
Date:                Fri, 06 Jan 2017   Prob (F-statistic):             0.0520
Time:                        10:48:17   Log-Likelihood:                -9.8391
No. Observations:                   6   AIC:                             23.68
Df Residuals:                       4   BIC:                             23.26
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     14.3333      1.106     12.965      0.000        11.264    17.403
date_delta     1.0000      0.365      2.739      0.052        -0.014     2.014
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.393
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.461
Skew:                          -0.649   Prob(JB):                        0.794
Kurtosis:                       2.602   Cond. No.                         5.78
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  value   R-squared:                       0.750
Model:                            OLS   Adj. R-squared:                  0.500
Method:                 Least Squares   F-statistic:                     3.000
Date:                Fri, 06 Jan 2017   Prob (F-statistic):              0.333
Time:                        10:48:17   Log-Likelihood:                -3.2171
No. Observations:                   3   AIC:                             10.43
Df Residuals:                       1   BIC:                             8.631
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     15.5000      1.118     13.864      0.046         1.294    29.706
date_delta    -1.5000      0.866     -1.732      0.333       -12.504     9.504
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   3.000
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.531
Skew:                          -0.707   Prob(JB):                        0.767
Kurtosis:                       1.500   Cond. No.                         2.92
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  value   R-squared:                        -inf
Model:                            OLS   Adj. R-squared:                   -inf
Method:                 Least Squares   F-statistic:                    -0.000
Date:                Fri, 06 Jan 2017   Prob (F-statistic):                nan
Time:                        10:48:17   Log-Likelihood:                 63.481
No. Observations:                   2   AIC:                            -123.0
Df Residuals:                       0   BIC:                            -125.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     16.0000        inf          0        nan           nan       nan
date_delta -3.553e-15        inf         -0        nan           nan       nan
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.400
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.333
Skew:                           0.000   Prob(JB):                        0.846
Kurtosis:                       1.000   Cond. No.                         2.62
==============================================================================

Answer 2

回答by Wizytor

As a newbie I cannot comment so I will write it as a new answer. To solve an error:

作为新手，我无法发表评论，因此我将其写为新答案。要解决错误：

Runtime Error: ValueError : Expected 2D array, got scalar array instead

you need to reshape delta value in line:

您需要在线重塑增量值：

return np.squeeze(LinearRegression().fit(X, y).predict(np.array(delta).reshape(1, -1)))

Credit stays for you piRSquared

信用为您保留 piRSquared

Python Pandas 线性回归 groupby

提问by jeangelj

采纳答案by piRSquared

New Answer

新答案

Old Answer

旧答案

回答by Wizytor

相关推荐

最近更新

标签

Python Pandas 线性回归 groupby

提问by jeangelj

采纳答案by piRSquared

New Answer

新答案

Old Answer

旧答案

回答by Wizytor

相关推荐

Pandas 中的数据透视表小计

从具有特定模式的 txt 文件创建 Pandas DataFrame

Pandas，Pivot 错误 - 无法使用空键标记索引

Pandas - KeyError：列不在索引中

相关推荐

最近更新

标签