将 statsmodels 摘要对象转换为 Pandas Dataframe

Question

提问by Sagun Kayastha

I am doing multiple linear regression with statsmodels.formula.api(ver 0.9.0) on Windows 10. After fitting the model and getting the summary with following lines i get summary in summary object format.

我正在statsmodels.formula.apiWindows 10 上使用(ver 0.9.0)进行多元线性回归。在拟合模型并使用以下几行获取摘要后，我得到摘要对象格式的摘要。

X_opt  = X[:, [0,1,2,3]]
regressor_OLS = sm.OLS(endog= y, exog= X_opt).fit()
regressor_OLS.summary()


                          OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.951
Model:                            OLS   Adj. R-squared:                  0.948
Method:                 Least Squares   F-statistic:                     296.0
Date:                Wed, 08 Aug 2018   Prob (F-statistic):           4.53e-30
Time:                        00:46:48   Log-Likelihood:                -525.39
No. Observations:                  50   AIC:                             1059.
Df Residuals:                      46   BIC:                             1066.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       5.012e+04   6572.353      7.626      0.000    3.69e+04    6.34e+04
x1             0.8057      0.045     17.846      0.000       0.715       0.897
x2            -0.0268      0.051     -0.526      0.602      -0.130       0.076
x3             0.0272      0.016      1.655      0.105      -0.006       0.060
==============================================================================
Omnibus:                       14.838   Durbin-Watson:                   1.282
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               21.442
Skew:                          -0.949   Prob(JB):                     2.21e-05
Kurtosis:                       5.586   Cond. No.                     1.40e+06
==============================================================================

I want to do backward elimination for P values for significance level 0.05. For this i need to remove the predictor with highest P values and run the code again.

我想对显着性水平 0.05 的 P 值进行反向消除。为此，我需要删除具有最高 P 值的预测器并再次运行代码。

I wanted to know if there is a way to extract the P values from the summary object, so that i can run a loop with conditional statement and find the significant variables without repeating the steps manually.

我想知道是否有办法从摘要对象中提取 P 值，以便我可以使用条件语句运行循环并找到重要变量，而无需手动重复这些步骤。

Thank you.

谢谢你。

Answer 1

回答by ZaxR

The answer from @Michael B works well, but requires "recreating" the table. The table itself is actually directly available from the summary().tables attribute. Each table in this attribute (which is a list of tables) is a SimpleTable, which has methods for outputting different formats. We can then read any of those formats back as a pd.DataFrame:

@Michael B 的答案效果很好，但需要“重新创建”表格。表格本身实际上可以直接从 summary().tables 属性中获得。这个属性中的每个表（它是一个表列表）都是一个SimpleTable，它具有输出不同格式的方法。然后我们可以将这些格式中的任何一个作为 pd.DataFrame 读回：

import statsmodels.api as sm

model = sm.OLS(y,x)
results = model.fit()
results_summary = results.summary()

# Note that tables is a list. The table at index 1 is the "core" table. Additionally, read_html puts dfs in a list, so we want index 0
results_as_html = results_summary.tables[1].as_html()
pd.read_html(results_as_html, header=0, index_col=0)[0]

Answer 2

回答by Michael B

Store your model fit as a variable results, like so:

将您的模型拟合存储为变量results，如下所示：

import statsmodels.api as sm
model = sm.OLS(y,x)
results = model.fit()

Then create a a function like below:

然后创建一个函数，如下所示：

def results_summary_to_dataframe(results):
    '''take the result of an statsmodel results table and transforms it into a dataframe'''
    pvals = results.pvalues
    coeff = results.params
    conf_lower = results.conf_int()[0]
    conf_higher = results.conf_int()[1]

    results_df = pd.DataFrame({"pvals":pvals,
                               "coeff":coeff,
                               "conf_lower":conf_lower,
                               "conf_higher":conf_higher
                                })

    #Reordering...
    results_df = results_df[["coeff","pvals","conf_lower","conf_higher"]]
    return results_df

You can further explore all the attributes of the resultsobject by using dir()to print, then add them to the function and df accordingly.

您可以results通过使用dir()打印来进一步探索对象的所有属性，然后将它们相应地添加到函数和 df 中。

Answer 3

回答by Daniel Zhou

An easy solution is just one line of code:

一个简单的解决方案只是一行代码：

LRresult = (result.summary2().tables[1])

This will give you a dataframe object:

这将为您提供一个数据框对象：

type(LRresult)

pandas.core.frame.DataFrame

To get the significant variables and run the test again:

要获取重要变量并再次运行测试：

newlist = list(LRresult[LRresult['P>|z|']<=0.05].index)[1:]
myform1 = 'binary_Target' + ' ~ ' + ' + '.join(newlist)

M1_test2 = smf.logit(formula=myform1,data=myM1_1)

result2 = M1_test2.fit(maxiter=200)
LRresult2 = (result2.summary2().tables[1])
LRresult2

Answer 4

回答by Abhishek Singh

You may write as below.It will be a easy fix and work almost appropriate every time.

你可以写如下。这将是一个简单的修复并且几乎每次都能正常工作。

lr.summary2()

Answer 5

回答by Griff

If you want the surrounding information, try the following:

如果你想要周围的信息，请尝试以下操作：

import pandas as pd
dfs = {}
fs = fa_model.summary()
for item in fs.tables[0].data:
    dfs[item[0].strip()] = item[1].strip()
    dfs[item[2].strip()] = item[3].strip()
for item in fs.tables[2].data:
    dfs[item[0].strip()] = item[1].strip()
    dfs[item[2].strip()] = item[3].strip()
dfs = pd.Series(dfs)

Answer 6

回答by Joop

The code below puts all the metrics into a dictionary accessible by key. The intermediate result is actually a DataFrameyou can use, I did not make the coefficients into a dictionary, but you can apply a similar method but then two levels deep dict[var][metric].

下面的代码将所有指标放入一个可通过 key 访问的字典中。中间结果其实是aDataFrame你可以用，我没有把系数做成a dictionary，但是你可以应用类似的方法但是然后两层深dict[var][metric]。

In order to make the keys easy to type, I converted some of the metric names into more easily typed versions. E.g. "Prob(Omnibus):" becomes prob_omnibus such that you can access the value by res_dict['prob_omnibus'].

为了使键易于键入，我将一些度量名称转换为更易于键入的版本。例如，"Prob(Omnibus):" 变为 prob_omnibus，这样您就可以通过 res_dict['prob_omnibus'] 访问该值。

import pandas as pd

res = sm.OLS(y, X).fit()
model_results_df = []
coefficient_df = None
for i, tab in enumerate(res.summary().tables):
    header, index_col = None, None
    if i == 1:
        coefficient_df = pd.read_html(tab.as_html(), header=0, index_col=0)[0]
    else:
        df = pd.read_html(tab.as_html())[0]
        model_results_df += [df.iloc[:,0:2], df.iloc[:,2:4]]

model_results_df = pd.DataFrame(np.concatenate(model_results_df), columns=['metric', 'value'])
model_results_df.dropna(inplace=True, axis=0)
model_results_df.metric = model_results_df.metric.apply(lambda x : x.lower().replace(' (', '_')
                                                        .replace('.', '').replace('(', '_')
                                                        .replace(')', '').replace('-', '_')
                                                       .replace(':', '').replace(' ', '_'))

res_dict = dict(zip(model_results_df.metric.values, model_results_df.value.values))
res_dict['f_statistic']

将 statsmodels 摘要对象转换为 Pandas Dataframe

提问by Sagun Kayastha

回答by ZaxR

回答by Michael B

回答by Daniel Zhou

回答by Abhishek Singh

回答by Griff

回答by Joop

相关推荐

最近更新

标签

将 statsmodels 摘要对象转换为 Pandas Dataframe

提问by Sagun Kayastha

回答by ZaxR

回答by Michael B

回答by Daniel Zhou

回答by Abhishek Singh

回答by Griff

回答by Joop

相关推荐

pandas python multiprocessing - 溢出错误（'无法序列化大于 4GiB 的字节对象'）

Pandas groupby 多列，多列列表

pandas python中pandas中DataFrame的dropna中的thresh

Pandas 按功能过滤数据框行

相关推荐

最近更新

标签