Python Pandas 或 Statsmodels 中的固定效果

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24195432/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 04:10:45  来源:igfitidea点击:

Fixed effect in Pandas or Statsmodels

pythonpandasregressionstatsmodels

提问by user3576212

Is there an existing function to estimate fixed effect (one-way or two-way) from Pandas or Statsmodels.

是否有现有的函数来估计 Pandas 或 Statsmodels 的固定效应(单向或双向)。

There used to be a function in Statsmodels but it seems discontinued. And in Pandas, there is something called plm, but I can't import it or run it using pd.plm().

Statsmodels 中曾经有一个函数,但它似乎已停止使用。在 Pandas 中,有一个名为 的东西plm,但我无法导入它或使用pd.plm().

采纳答案by Karl D.

As noted in the comments, PanelOLS has been removed from Pandas as of version 0.20.0. So you really have three options:

如评论中所述,PanelOLS 已从 Pandas 0.20.0 版本中删除。所以你真的有三个选择:

  1. If you use Python 3 you can use linearmodelsas specified in the more recent answer: https://stackoverflow.com/a/44836199/3435183

  2. Just specify various dummies in your statsmodelsspecification, e.g. using pd.get_dummies. May not be feasible if the number of fixed effects is large.

  3. Or do some groupby based demeaning and then use statsmodels(this would work if you're estimating lots of fixed effects). Here is a barebones version of what you could do for one way fixed effects:

    def areg(formula,data=None,absorb=None,cluster=None): 
    
        y,X = patsy.dmatrices(formula,data,return_type='dataframe')
    
        ybar = y.mean()
        y = y -  y.groupby(data[absorb]).transform('mean') + ybar
    
        Xbar = X.mean()
        X = X - X.groupby(data[absorb]).transform('mean') + Xbar
    
        reg = sm.OLS(y,X)
        # Account for df loss from FE transform
        reg.df_resid -= (data[absorb].nunique() - 1)
    
        return reg.fit(cov_type='cluster',cov_kwds={'groups':data[cluster].values})
    
  1. 如果您使用 Python 3,则可以linearmodels按照最新答案中的说明使用:https: //stackoverflow.com/a/44836199/3435183

  2. 只需在您的statsmodels规范中指定各种虚拟对象,例如使用pd.get_dummies. 如果固定效应的数量很大,则可能不可行。

  3. 或者做一些基于 groupby 的贬低,然后使用statsmodels(如果你估计很多固定效果,这会起作用)。这是您可以为一种方式固定效果做的事情的准系统版本:

    def areg(formula,data=None,absorb=None,cluster=None): 
    
        y,X = patsy.dmatrices(formula,data,return_type='dataframe')
    
        ybar = y.mean()
        y = y -  y.groupby(data[absorb]).transform('mean') + ybar
    
        Xbar = X.mean()
        X = X - X.groupby(data[absorb]).transform('mean') + Xbar
    
        reg = sm.OLS(y,X)
        # Account for df loss from FE transform
        reg.df_resid -= (data[absorb].nunique() - 1)
    
        return reg.fit(cov_type='cluster',cov_kwds={'groups':data[cluster].values})
    

And here is what you can do if using an older version of Pandas:

如果使用旧版本,您可以执行以下操作Pandas

An example with time fixed effects using pandas' PanelOLS(which is in the plm module). Notice, the import of PanelOLS:

使用 pandas' PanelOLS(在 plm 模块中)的时间固定效果示例。注意,导入PanelOLS

>>> from pandas.stats.plm import PanelOLS
>>> df

                y    x
date       id
2012-01-01 1   0.1  0.2
           2   0.3  0.5
           3   0.4  0.8
           4   0.0  0.2
2012-02-01 1   0.2  0.7 
           2   0.4  0.5
           3   0.2  0.3
           4   0.1  0.1
2012-03-01 1   0.6  0.9
           2   0.7  0.5
           3   0.9  0.6
           4   0.4  0.5

Note, the dataframe must have a multindex set ; panelOLSdetermines the timeand entityeffects based on the index:

注意,数据框必须有一个多索引集;根据指数panelOLS确定timeentity影响:

>>> reg  = PanelOLS(y=df['y'],x=df[['x']],time_effects=True)
>>> reg

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x>

Number of Observations:         12
Number of Degrees of Freedom:   4

R-squared:         0.2729
Adj R-squared:     0.0002

Rmse:              0.1588

F-stat (1, 8):     1.0007, p-value:     0.3464

Degrees of Freedom: model 3, resid 8

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     0.3694     0.2132       1.73     0.1214    -0.0485     0.7872
---------------------------------End of Summary--------------------------------- 

Docstring:

文档字符串:

PanelOLS(self, y, x, weights = None, intercept = True, nw_lags = None,
entity_effects = False, time_effects = False, x_effects = None,
cluster = None, dropped_dummies = None, verbose = False,
nw_overlap = False)

Implements panel OLS.

See ols function docs

This is another function (like fama_macbeth) where I believe the plan is to move this functionality to statsmodels.

这是另一个功能(如fama_macbeth),我认为计划是将此功能移至statsmodels.

回答by Kevin S

There is a package called linearmodels(https://pypi.org/project/linearmodels/) that has a fairly complete fixed effects and random effects implementation including clustered standard errors. It does not use high-dimensional OLS to eliminate effects and so can be used with large data sets.

有一个名为linearmodels( https://pypi.org/project/linearmodels/)的包,它具有相当完整的固定效应和随机效应实现,包括集群标准误差。它不使用高维 OLS 来消除影响,因此可以用于大型数据集。

# Outer is entity, inner is time
entity = list(map(chr,range(65,91)))
time = list(pd.date_range('1-1-2014',freq='A', periods=4))
index = pd.MultiIndex.from_product([entity, time])
df = pd.DataFrame(np.random.randn(26*4, 2),index=index, columns=['y','x'])

from linearmodels.panel import PanelOLS
mod = PanelOLS(df.y, df.x, entity_effects=True)
res = mod.fit(cov_type='clustered', cluster_entity=True)
print(res)

This produces the following output:

这会产生以下输出:

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                      y   R-squared:                        0.0029
Estimator:                   PanelOLS   R-squared (Between):             -0.0109
No. Observations:                 104   R-squared (Within):               0.0029
Date:                Thu, Jun 29 2017   R-squared (Overall):             -0.0007
Time:                        23:52:28   Log-likelihood                   -125.69
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      0.2256
Entities:                          26   P-value                           0.6362
Avg Obs:                       4.0000   Distribution:                    F(1,77)
Min Obs:                       4.0000                                           
Max Obs:                       4.0000   F-statistic (robust):             0.1784
                                        P-value                           0.6739
Time periods:                       4   Distribution:                    F(1,77)
Avg Obs:                       26.000                                           
Min Obs:                       26.000                                           
Max Obs:                       26.000                                           

                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
x              0.0573     0.1356     0.4224     0.6739     -0.2127      0.3273
==============================================================================

F-test for Poolability: 1.0903
P-value: 0.3739
Distribution: F(25,77)

Included effects: Entity

It also has a formula interface which is similar to statsmodels,

它还有一个类似于statsmodels的公式界面,

mod = PanelOLS.from_formula('y ~ x + EntityEffects', df)