将 statsmodels 摘要对象转换为 Pandas Dataframe
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/51734180/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Converting statsmodels summary object to Pandas Dataframe
提问by Sagun Kayastha
I am doing multiple linear regression with statsmodels.formula.api
(ver 0.9.0) on Windows 10. After fitting the model and getting the summary with following lines i get summary in summary object format.
我正在statsmodels.formula.api
Windows 10 上使用(ver 0.9.0)进行多元线性回归。在拟合模型并使用以下几行获取摘要后,我得到摘要对象格式的摘要。
X_opt = X[:, [0,1,2,3]]
regressor_OLS = sm.OLS(endog= y, exog= X_opt).fit()
regressor_OLS.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.951
Model: OLS Adj. R-squared: 0.948
Method: Least Squares F-statistic: 296.0
Date: Wed, 08 Aug 2018 Prob (F-statistic): 4.53e-30
Time: 00:46:48 Log-Likelihood: -525.39
No. Observations: 50 AIC: 1059.
Df Residuals: 46 BIC: 1066.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.012e+04 6572.353 7.626 0.000 3.69e+04 6.34e+04
x1 0.8057 0.045 17.846 0.000 0.715 0.897
x2 -0.0268 0.051 -0.526 0.602 -0.130 0.076
x3 0.0272 0.016 1.655 0.105 -0.006 0.060
==============================================================================
Omnibus: 14.838 Durbin-Watson: 1.282
Prob(Omnibus): 0.001 Jarque-Bera (JB): 21.442
Skew: -0.949 Prob(JB): 2.21e-05
Kurtosis: 5.586 Cond. No. 1.40e+06
==============================================================================
I want to do backward elimination for P values for significance level 0.05. For this i need to remove the predictor with highest P values and run the code again.
我想对显着性水平 0.05 的 P 值进行反向消除。为此,我需要删除具有最高 P 值的预测器并再次运行代码。
I wanted to know if there is a way to extract the P values from the summary object, so that i can run a loop with conditional statement and find the significant variables without repeating the steps manually.
我想知道是否有办法从摘要对象中提取 P 值,以便我可以使用条件语句运行循环并找到重要变量,而无需手动重复这些步骤。
Thank you.
谢谢你。
回答by ZaxR
The answer from @Michael B works well, but requires "recreating" the table. The table itself is actually directly available from the summary().tables attribute. Each table in this attribute (which is a list of tables) is a SimpleTable, which has methods for outputting different formats. We can then read any of those formats back as a pd.DataFrame:
@Michael B 的答案效果很好,但需要“重新创建”表格。表格本身实际上可以直接从 summary().tables 属性中获得。这个属性中的每个表(它是一个表列表)都是一个SimpleTable,它具有输出不同格式的方法。然后我们可以将这些格式中的任何一个作为 pd.DataFrame 读回:
import statsmodels.api as sm
model = sm.OLS(y,x)
results = model.fit()
results_summary = results.summary()
# Note that tables is a list. The table at index 1 is the "core" table. Additionally, read_html puts dfs in a list, so we want index 0
results_as_html = results_summary.tables[1].as_html()
pd.read_html(results_as_html, header=0, index_col=0)[0]
回答by Michael B
Store your model fit as a variable results
, like so:
将您的模型拟合存储为变量results
,如下所示:
import statsmodels.api as sm
model = sm.OLS(y,x)
results = model.fit()
Then create a a function like below:
然后创建一个函数,如下所示:
def results_summary_to_dataframe(results):
'''take the result of an statsmodel results table and transforms it into a dataframe'''
pvals = results.pvalues
coeff = results.params
conf_lower = results.conf_int()[0]
conf_higher = results.conf_int()[1]
results_df = pd.DataFrame({"pvals":pvals,
"coeff":coeff,
"conf_lower":conf_lower,
"conf_higher":conf_higher
})
#Reordering...
results_df = results_df[["coeff","pvals","conf_lower","conf_higher"]]
return results_df
You can further explore all the attributes of the results
object by using dir()to print, then add them to the function and df accordingly.
您可以results
通过使用dir()打印来进一步探索对象的所有属性,然后将它们相应地添加到函数和 df 中。
回答by Daniel Zhou
An easy solution is just one line of code:
一个简单的解决方案只是一行代码:
LRresult = (result.summary2().tables[1])
This will give you a dataframe object:
这将为您提供一个数据框对象:
type(LRresult)
pandas.core.frame.DataFrame
pandas.core.frame.DataFrame
To get the significant variables and run the test again:
要获取重要变量并再次运行测试:
newlist = list(LRresult[LRresult['P>|z|']<=0.05].index)[1:]
myform1 = 'binary_Target' + ' ~ ' + ' + '.join(newlist)
M1_test2 = smf.logit(formula=myform1,data=myM1_1)
result2 = M1_test2.fit(maxiter=200)
LRresult2 = (result2.summary2().tables[1])
LRresult2
回答by Abhishek Singh
You may write as below.It will be a easy fix and work almost appropriate every time.
你可以写如下。这将是一个简单的修复并且几乎每次都能正常工作。
lr.summary2()
回答by Griff
If you want the surrounding information, try the following:
如果你想要周围的信息,请尝试以下操作:
import pandas as pd
dfs = {}
fs = fa_model.summary()
for item in fs.tables[0].data:
dfs[item[0].strip()] = item[1].strip()
dfs[item[2].strip()] = item[3].strip()
for item in fs.tables[2].data:
dfs[item[0].strip()] = item[1].strip()
dfs[item[2].strip()] = item[3].strip()
dfs = pd.Series(dfs)
回答by Joop
The code below puts all the metrics into a dictionary accessible by key. The intermediate result is actually a DataFrame
you can use, I did not make the coefficients into a dictionary
, but you can apply a similar method but then two levels deep dict[var][metric]
.
下面的代码将所有指标放入一个可通过 key 访问的字典中。中间结果其实是aDataFrame
你可以用,我没有把系数做成a dictionary
,但是你可以应用类似的方法但是然后两层深dict[var][metric]
。
In order to make the keys easy to type, I converted some of the metric names into more easily typed versions. E.g. "Prob(Omnibus):" becomes prob_omnibus such that you can access the value by res_dict['prob_omnibus'].
为了使键易于键入,我将一些度量名称转换为更易于键入的版本。例如,"Prob(Omnibus):" 变为 prob_omnibus,这样您就可以通过 res_dict['prob_omnibus'] 访问该值。
import pandas as pd
res = sm.OLS(y, X).fit()
model_results_df = []
coefficient_df = None
for i, tab in enumerate(res.summary().tables):
header, index_col = None, None
if i == 1:
coefficient_df = pd.read_html(tab.as_html(), header=0, index_col=0)[0]
else:
df = pd.read_html(tab.as_html())[0]
model_results_df += [df.iloc[:,0:2], df.iloc[:,2:4]]
model_results_df = pd.DataFrame(np.concatenate(model_results_df), columns=['metric', 'value'])
model_results_df.dropna(inplace=True, axis=0)
model_results_df.metric = model_results_df.metric.apply(lambda x : x.lower().replace(' (', '_')
.replace('.', '').replace('(', '_')
.replace(')', '').replace('-', '_')
.replace(':', '').replace(' ', '_'))
res_dict = dict(zip(model_results_df.metric.values, model_results_df.value.values))
res_dict['f_statistic']