Python 中的 Fama Macbeth 回归(Pandas 或 Statsmodels)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24074481/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fama Macbeth Regression in Python (Pandas or Statsmodels)
提问by user3576212
Econometric Backgroud
计量经济学背景
Fama Macbeth regression refers to a procedure to run regression for panel data (where there are N different individuals and each individual corresponds to multiple periods T, e.g. day, months,year). So in total there are N x T obs. Notice it's OK if the panel data is not balanced.
Fama Macbeth 回归是指对面板数据(其中有 N 个不同的个体,每个个体对应多个时期 T,例如日、月、年)运行回归的程序。所以总共有 N x T obs。注意面板数据不平衡也没关系。
The Fama Macbeth regression is to first run regression for each period cross-sectinally, i.e. pool N individuals together in a given period t. And do this for t=1,...T. So in total T regressions are run. Then we have a time series of coefficients for each independent variable. Then we can perform hypothesis test using the time series of coefficients. Usually we take the average as the final coefficients of each independent variable. And we use t-stats to test significance.
Fama Macbeth 回归首先对每个时期进行横断面回归,即将给定时期 t 内的 N 个个体汇集在一起。并为 t=1,...T 执行此操作。所以总共运行了 T 个回归。然后我们有每个自变量的时间序列系数。然后我们可以使用系数的时间序列进行假设检验。通常我们取平均值作为每个自变量的最终系数。我们使用 t-stats 来检验显着性。
My Problem
我的问题
My problem is to implement this in pandas. From the source code of pandas, I noticed there is a procedure called fama_macbeth. But I can't find any documentation about this.
我的问题是在Pandas中实现这一点。从 pandas 的源代码中,我注意到有一个名为fama_macbeth. 但我找不到任何关于此的文档。
The operation can be easily done through groupbyas well. Currently I am doing this:
该操作也可以轻松完成groupby。目前我正在这样做:
def fmreg(data,formula):
return smf.ols(formula,data=data).fit().params[1]
res=df.groupby('date').apply(fmreg,'ret~var1')
This works, resis a Series which is indexed by dateand the values of Series are params[1], which is the coefficient of var1. But now I want to have more independent variables, I need to extract the coefficients of all these independent variables, but I can't figure that out. I tried this
这是有效的,res是一个以 为索引date的系列params[1],系列的值为,这是 的系数var1。但是现在我想要更多的自变量,我需要提取所有这些自变量的系数,但我无法弄清楚。我试过这个
def fmreg(data,formula):
return smf.ols(formula,data=data).fit().params
res=df.groupby('date').apply(fmreg,'ret~var1+var2+var3')
This won't work. The desired result is that resis a dataframe indexed by date, and each column of the dataframe should contain the coefficients of each variable intercept, var1, var2and var3.
这行不通。期望的结果是,res是由索引的数据帧date,以及数据帧的每列应包含各变量的系数intercept,var1,var2和var3。
I also checked with statsmodels, they don't have such built-in procedure as well.
我也检查过statsmodels,他们也没有这样的内置程序。
And is there any package that can produce publication-quality regression tables? Like outreg2in Stata and texregin R?
Thanks for your help!
是否有任何软件包可以生成出版质量的回归表?就像outreg2在 Stata 和texregR 中一样?谢谢你的帮助!
回答by Karl D.
An update to reflect the library situation for Fama-MacBeth as of Fall 2018. The fama_macbethfunction has been removed from pandasfor a while now. So what are your options?
更新以反映截至 2018 年秋季 Fama-MacBeth 的库情况。该fama_macbeth功能已被删除pandas一段时间。那么你有哪些选择?
If you're using python 3, then you can use the Fama-MacBeth method in LinearModels: https://github.com/bashtage/linearmodels/blob/master/linearmodels/panel/model.py
If you're using python 2 or just don't want to use LinearModels, then probably your best option is to roll you own.
如果您使用的是 python 3,那么您可以在 LinearModels 中使用 Fama-MacBeth 方法:https: //github.com/bashtage/linearmodels/blob/master/linearmodels/panel/model.py
如果您使用的是 python 2 或者只是不想使用 LinearModels,那么您最好的选择可能是自己动手。
For example, suppose you have the Fama-French industry portfolios in a panel like the following (you've also computed some variables like past beta or past returns to use as your x-variables):
例如,假设您在如下面板中拥有 Fama-French 行业投资组合(您还计算了一些变量,如过去的 beta 或过去的回报,以用作您的 x 变量):
In [1]: import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
In [4]: df = pd.read_csv('industry.csv',parse_dates=['caldt'])
df.query("caldt == '1995-07-01'")
In [5]: Out[5]:
industry caldt ret beta r12to2 r36to13
18432 Aero 1995-07-01 6.26 0.9696 0.2755 0.3466
18433 Agric 1995-07-01 3.37 1.0412 0.1260 0.0581
18434 Autos 1995-07-01 2.42 1.0274 0.0293 0.2902
18435 Banks 1995-07-01 4.82 1.4985 0.1659 0.2951
Fama-MacBeth primarily involves computing the same cross-sectional regression model month by month, so you can implement it using a groupby. You can create a function that takes a dataframe(it will come from the groupby) and a patsyformula; it then fits the model and returns the parameter estimates. Here is a barebones version of how you could implement it (note this is what the original questioner tried to do a few years ago ... not sure why it didn't work although it's possible back then statsmodelsresult object method paramswasn't returning a pandasSeriesso the return needed to be converted to a Seriesexplicitly ... it does work fine in the current version of pandas, 0.23.4):
Fama-MacBeth 主要涉及逐月计算相同的横截面回归模型,因此您可以使用groupby. 您可以创建一个函数,它接受一个dataframe(它将来自groupby)和一个patsy公式;然后拟合模型并返回参数估计值。这是您如何实现它的准系统版本(请注意,这是最初的提问者几年前试图做的事情......不知道为什么它不起作用,尽管当时有可能statsmodels结果对象方法params没有返回一个pandasSeries所以返回需要转换为Series显式......它在当前版本的pandas0.23.4 中确实可以正常工作):
def ols_coef(x,formula):
return smf.ols(formula,data=x).fit().params
In [9]: gamma = (df.groupby('caldt')
.apply(ols_coef,'ret ~ 1 + beta + r12to2 + r36to13'))
gamma.head()
In [10]: Out[10]:
Intercept beta r12to2 r36to13
caldt
1963-07-01 -1.497012 -0.765721 4.379128 -1.918083
1963-08-01 11.144169 -6.506291 5.961584 -2.598048
1963-09-01 -2.330966 -0.741550 10.508617 -4.377293
1963-10-01 0.441941 1.127567 5.478114 -2.057173
1963-11-01 3.380485 -4.792643 3.660940 -1.210426
Then just compute the mean, standard error on the mean, and a t-test (or whatever statistics you want). Something like the following:
然后只需计算均值、均值的标准误差和 t 检验(或您想要的任何统计数据)。类似于以下内容:
def fm_summary(p):
s = p.describe().T
s['std_error'] = s['std']/np.sqrt(s['count'])
s['tstat'] = s['mean']/s['std_error']
return s[['mean','std_error','tstat']]
In [12]: fm_summary(gamma)
Out[12]:
mean std_error tstat
Intercept 0.754904 0.177291 4.258000
beta -0.012176 0.202629 -0.060092
r12to2 1.794548 0.356069 5.039896
r36to13 0.237873 0.186680 1.274230
Improving Speed
提高速度
Using statsmodelsfor the regressions has significant overhead (particularly given you only need the estimated coefficients). If you want better efficiency, then you could switch from statsmodelsto numpy.linalg.lstsq. Write a new function that does the ols estimation ... something like the following (notice I'm not doing anything like checking the rank of these matrices ...):
使用statsmodels的回归有显著的开销(特别是考虑到你只需要估计系数)。如果你想要更高的效率,那么你可以从 切换statsmodels到numpy.linalg.lstsq。编写一个执行 ols 估计的新函数......类似于以下内容(注意我没有做任何类似检查这些矩阵的排名......):
def ols_np(data,yvar,xvar):
gamma,_,_,_ = np.linalg.lstsq(data[xvar],data[yvar],rcond=None)
return pd.Series(gamma)
And if you're still using an older version of pandas, the following will work:
如果您仍在使用旧版本的pandas,以下内容将起作用:
Here is an example of using the fama_macbethfunction in pandas:
以下是在 中使用该fama_macbeth函数的示例pandas:
>>> df
y x
date id
2012-01-01 1 0.1 0.4
2 0.3 0.6
3 0.4 0.2
4 0.0 1.2
2012-02-01 1 0.2 0.7
2 0.4 0.5
3 0.2 0.1
4 0.1 0.0
2012-03-01 1 0.4 0.8
2 0.6 0.1
3 0.7 0.6
4 0.4 -0.1
Notice, the structure. The fama_macbethfunction expects the y-var and x-vars to have a multi-index with date as the first variable and the stock/firm/entity id as the second variable in the index:
注意,结构。该fama_macbeth函数期望 y-var 和 x-vars 具有多索引,其中日期作为第一个变量,股票/公司/实体 ID 作为索引中的第二个变量:
>>> fm = pd.fama_macbeth(y=df['y'],x=df[['x']])
>>> fm
----------------------Summary of Fama-MacBeth Analysis-------------------------
Formula: Y ~ x + intercept
# betas : 3
----------------------Summary of Estimated Coefficients------------------------
Variable Beta Std Err t-stat CI 2.5% CI 97.5%
(x) -0.0227 0.1276 -0.18 -0.2728 0.2273
(intercept) 0.3531 0.0842 4.19 0.1881 0.5181
--------------------------------End of Summary---------------------------------
Note that just printing fmcalls fm.summary
请注意,只是打印fm调用 fm.summary
>>> fm.summary
----------------------Summary of Fama-MacBeth Analysis-------------------------
Formula: Y ~ x + intercept
# betas : 3
----------------------Summary of Estimated Coefficients------------------------
Variable Beta Std Err t-stat CI 2.5% CI 97.5%
(x) -0.0227 0.1276 -0.18 -0.2728 0.2273
(intercept) 0.3531 0.0842 4.19 0.1881 0.5181
--------------------------------End of Summary---------------------------------
Also, note the fama_macbethfunction automatically adds an intercept (as opposed to statsmodelsroutines). Also the x-var has to be a dataframeso if you pass just one column you need to pass it as df[['x']].
另外,请注意该fama_macbeth函数会自动添加一个拦截(与statsmodels例程相反)。此外,x-var 必须是一个,dataframe所以如果你只传递一列,你需要将它作为df[['x']].
If you don't want an intercept you have to do:
如果您不想拦截,则必须执行以下操作:
>>> fm = pd.fama_macbeth(y=df['y'],x=df[['x']],intercept=False)
回答by D.J. P.
EDIT: New Library
编辑:新图书馆
An updated library exists which can be installed via the following command:
存在可以通过以下命令安装的更新库:
pip install finance-byu
Documentation here: https://fin-library.readthedocs.io/en/latest/
此处的文档:https: //fin-library.readthedocs.io/en/latest/
The new library includes Fama Macbeth regression implementations and a Regtableclass that can be helpful for reporting results.
新库包括 Fama Macbeth 回归实现和一个Regtable有助于报告结果的类。
This page in the documentation outlines the Fama Macbeth functions: https://fin-library.readthedocs.io/en/latest/fama_macbeth.html
文档中的此页面概述了 Fama Macbeth 函数:https: //fin-library.readthedocs.io/en/latest/fama_macbeth.html
There is an implementation which is very similar to Karl D.'s implementation above with numpy's linear algebra functions, an implementation that utilizes joblibfor parallelization to increase performance when a large number of time periods in the data, and an implementation using numbafor optimization that shaves off an order of magnitude on small data sets.
有一个实现与上面带有numpy线性代数函数的Karl D. 的实现非常相似,一个实现joblib在数据中有大量时间段时利用并行化来提高性能,以及一个numba用于优化的实现在小数据集上削减了一个数量级。
Here is an example with a small simulated data set as in the documentation:
这是一个示例,其中包含文档中的小型模拟数据集:
>>> from finance_byu.fama_macbeth import fama_macbeth, fama_macbeth_parallel, fm_summary, fama_macbeth_numba
>>> import pandas as pd
>>> import time
>>> import numpy as np
>>>
>>> n_jobs = 5
>>> n_firms = 1.0e2
>>> n_periods = 1.0e2
>>>
>>> def firm(fid):
>>> f = np.random.random((int(n_periods),4))
>>> f = pd.DataFrame(f)
>>> f['period'] = f.index
>>> f['firmid'] = fid
>>> return f
>>> df = [firm(i) for i in range(int(n_firms))]
>>> df = pd.concat(df).rename(columns={0:'ret',1:'exmkt',2:'smb',3:'hml'})
>>> df.head()
ret exmkt smb hml period firmid
0 0.766593 0.002390 0.496230 0.992345 0 0
1 0.346250 0.509880 0.083644 0.732374 1 0
2 0.787731 0.204211 0.705075 0.313182 2 0
3 0.904969 0.338722 0.437298 0.669285 3 0
4 0.121908 0.827623 0.319610 0.455530 4 0
>>> result = fama_macbeth(df,'period','ret',['exmkt','smb','hml'],intercept=True)
>>> result.head()
intercept exmkt smb hml
period
0 0.655784 -0.160938 -0.109336 0.028015
1 0.455177 0.033941 0.085344 0.013814
2 0.410705 -0.084130 0.218568 0.016897
3 0.410537 0.010719 0.208912 0.001029
4 0.439061 0.046104 -0.084381 0.199775
>>> fm_summary(result)
mean std_error tstat
intercept 0.506834 0.008793 57.643021
exmkt 0.004750 0.009828 0.483269
smb -0.012702 0.010842 -1.171530
hml 0.004276 0.010530 0.406119
>>> %timeit fama_macbeth(df,'period','ret',['exmkt','smb','hml'],intercept=True)
123 ms ± 117 μs per loop (mean ± std. dev. of 7 runs, 10 loops each
>>> %timeit fama_macbeth_parallel(df,'period','ret',['exmkt','smb','hml'],intercept=True,n_jobs=n_jobs,memmap=False)
146 ms ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit fama_macbeth_numba(df,'period','ret',['exmkt','smb','hml'],intercept=True)
5.04 ms ± 5.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Note: Turning off the memmap makes for fair comparison without generating new data at each run. With the memmap, the parallel implementation would simply pull cached results.
注意:关闭 memmap 可以进行公平比较,而无需在每次运行时生成新数据。使用 memmap,并行实现将简单地提取缓存结果。
Here are a couple simple implementations of the table class also using simulated data:
以下是表类的几个简单实现,也使用模拟数据:
>>> from finance_byu.regtables import Regtable
>>> import pandas as pd
>>> import statsmodels.formula.api as smf
>>> import numpy as np
>>>
>>>
>>> nobs = 1000
>>> df = pd.DataFrame(np.random.random((nobs,3))).rename(columns={0:'age',1:'bmi',2:'hincome'})
>>> df['age'] = df['age']*100
>>> df['bmi'] = df['bmi']*30
>>> df['hincome'] = df['hincome']*100000
>>> df['hincome'] = pd.qcut(df['hincome'],16,labels=False)
>>> df['rich'] = df['hincome'] > 13
>>> df['gender'] = np.random.choice(['M','F'],nobs)
>>> df['race'] = np.random.choice(['W','B','H','O'],nobs)
>>>
>>> regformulas = ['bmi ~ age',
>>> 'bmi ~ np.log(age)',
>>> 'bmi ~ C(gender) + np.log(age)',
>>> 'bmi ~ C(gender) + C(race) + np.log(age)',
>>> 'bmi ~ C(gender) + rich + C(gender)*rich + C(race) + np.log(age)',
>>> 'bmi ~ -1 + np.log(age)',
>>> 'bmi ~ -1 + C(race) + np.log(age)']
>>> reg = [smf.ols(f,df).fit() for f in regformulas]
>>> tbl = Regtable(reg)
>>> tbl.render()
>>> df2 = pd.DataFrame(np.random.random((nobs,10)))
>>> df2.columns = ['t0_vw','t4_vw','et_vw','t0_ew','t4_ew','et_ew','mktrf','smb','hml','umd']
>>> regformulas2 = ['t0_vw ~ mktrf',
>>> 't0_vw ~ mktrf + smb + hml',
>>> 't0_vw ~ mktrf + smb + hml + umd',
>>> 't4_vw ~ mktrf',
>>> 't4_vw ~ mktrf + smb + hml',
>>> 't4_vw ~ mktrf + smb + hml + umd',
>>> 'et_vw ~ mktrf',
>>> 'et_vw ~ mktrf + smb + hml',
>>> 'et_vw ~ mktrf + smb + hml + umd',
>>> 't0_ew ~ mktrf',
>>> 't0_ew ~ mktrf + smb + hml',
>>> 't0_ew ~ mktrf + smb + hml + umd',
>>> 't4_ew ~ mktrf',
>>> 't4_ew ~ mktrf + smb + hml',
>>> 't4_ew ~ mktrf + smb + hml + umd',
>>> 'et_ew ~ mktrf',
>>> 'et_ew ~ mktrf + smb + hml',
>>> 'et_ew ~ mktrf + smb + hml + umd'
>>> ]
>>> regnames = ['Small VW','','',
>>> 'Large VW','','',
>>> 'Spread VW','','',
>>> 'Small EW','','',
>>> 'Large EW','','',
>>> 'Spread EW','',''
>>> ]
>>> reg2 = [smf.ols(f,df2).fit() for f in regformulas2]
>>>
>>> tbl2 = Regtable(reg2,orientation='horizontal',regnames=regnames,sig='coeff',intercept_name='alpha',nobs=False,rsq=False,stat='se')
>>> tbl2.render()
Produces the following:
产生以下内容:


The documentation for the Regtable class is here: https://byu-finance-library-finance-byu.readthedocs.io/en/latest/regtables.html
Regtable 类的文档在这里:https://byu-finance-library-finance-byu.readthedocs.io/en/latest/regtables.html
These tables can be exported to LaTeX for easy incorporation into writing:
这些表可以导出到 LaTeX 以便于写入:
tbl.to_latex()
回答by Ricardo Bindi
A quick and dirty solution to solve the problem and continue using the same thing you were using.
一个快速而肮脏的解决方案来解决问题并继续使用你正在使用的东西。
It worked for me.
它对我有用。
def fmreg(data,formula):
return smf.ols(formula,data=data).fit().params[:]
res = df.groupby('date').apply(fmreg,'ret~var1+var2+var3')

