pandas Python 2.7 - statsmodels - 格式化和编写摘要输出
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16705598/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python 2.7 - statsmodels - formatting and writing summary output
提问by DMML
I'm doing logistic regression using pandas 0.11.0(data handling) and statsmodels 0.4.3to do the actual regression, on Mac OSX Lion.
我正在使用pandas 0.11.0(数据处理)进行逻辑回归并statsmodels 0.4.3在 Mac OSX Lion 上进行实际回归。
I'm going to be running ~2,900 different logistic regression models and need the results output to csv file and formatted in a particular way.
我将运行约 2,900 个不同的逻辑回归模型,需要将结果输出到 csv 文件并以特定方式格式化。
Currently, I'm only aware of doing print result.summary()which prints the results (as follows) to the shell:
目前,我只知道print result.summary()将结果(如下)打印到 shell:
Logit Regression Results
==============================================================================
Dep. Variable: death_death No. Observations: 9752
Model: Logit Df Residuals: 9747
Method: MLE Df Model: 4
Date: Wed, 22 May 2013 Pseudo R-squ.: -0.02672
Time: 22:15:05 Log-Likelihood: -5806.9
converged: True LL-Null: -5655.8
LLR p-value: 1.000
===============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
-------------------------------------------------------------------------------
age_age5064 -0.1999 0.055 -3.619 0.000 -0.308 -0.092
age_age6574 -0.2553 0.053 -4.847 0.000 -0.359 -0.152
sex_female -0.2515 0.044 -5.765 0.000 -0.337 -0.166
stage_early -0.1838 0.041 -4.528 0.000 -0.263 -0.104
access -0.0102 0.001 -16.381 0.000 -0.011 -0.009
===============================================================================
I will also need the odds ratio, which is computed by print np.exp(result.params), and is printed in the shell as such:
我还需要优势比,它由 计算print np.exp(result.params),并在外壳中打印如下:
age_age5064 0.818842
age_age6574 0.774648
sex_female 0.777667
stage_early 0.832098
access 0.989859
dtype: float64
What I need is for these each to be written to a csv file in form of a very lon row like (am not sure, at this point, whether I will need things like Log-Likelihood, but have included it for the sake of thoroughness):
我需要的是将这些每一个都以非常长的行的形式写入一个 csv 文件(我不确定,在这一点上,我是否需要类似的东西Log-Likelihood,但为了彻底而将它包含在内):
`Log-Likelihood, age_age5064_coef, age_age5064_std_err, age_age5064_z, age_age5064_p>|z|,...age_age6574_coef, age_age6574_std_err, ......access_coef, access_std_err, ....age_age5064_odds_ratio, age_age6574_odds_ratio, ...sex_female_odds_ratio,.....access_odds_ratio`
I think you get the picture - a very long row, with all of these actual values, and a header with all the column designations in a similar format.
我想你明白了 - 一个很长的行,包含所有这些实际值,以及一个标题,所有列名称都采用类似的格式。
I am familiar with the csv modulein Python, and am becoming more familiar with pandas. Not sure whether this info could be formatted and stored in a pandas dataframeand then written, using to_csvto a file once all ~2,900 logistic regression models have completed; that would certainly be fine. Also, writing them as each model is completed is also fine (using csv module).
我熟悉csv modulePython 中的 ,并且越来越熟悉pandas. 不确定此信息是否可以格式化并存储在 a 中pandas dataframe,然后在to_csv所有 ~2,900 个逻辑回归模型完成后写入文件;那当然没问题。此外,在每个模型完成时编写它们也很好(使用csv module)。
UPDATE:
更新:
So, I was looking more at statsmodels site, specifically trying to figure out how the results of a model are stored within classes. It looks like there is a class called 'Results', which will need to be used. I think using inheritance from this class to create another class, where some of the methods/operators get changed might be the way to go, in order to get the formatting I require. I have very little experience in the ways of doing this, and will need to spend quite a bit of time figuring this out (which is fine). If anybody can help/has more experience that would be awesome!
所以,我更多地关注 statsmodels 站点,特别是试图弄清楚模型的结果如何存储在类中。看起来有一个名为“Results”的类,需要使用它。我认为使用这个类的继承来创建另一个类,其中一些方法/运算符被更改可能是要走的路,以获得我需要的格式。我在这方面的经验很少,需要花很多时间来解决这个问题(这很好)。如果有人可以提供帮助/有更多经验,那就太棒了!
Here is the site where the classes are laid out: statsmodels results class
这是布置课程的网站:statsmodels results class
采纳答案by Josef
There is no premade table of parameters and their result statistics currently available.
目前没有预制的参数表及其结果统计数据。
Essentially you need to stack all the results yourself, whether in a list, numpy array or pandas DataFrame depends on what's more convenient for you.
本质上,您需要自己堆叠所有结果,无论是在列表、numpy 数组还是 Pandas DataFrame 中,都取决于对您来说更方便的方式。
for example, if I want one numpy array that has the results for a model, llf and results in the summary parameter table, then I could use
例如,如果我想要一个包含模型结果的 numpy 数组,llf 和汇总参数表中的结果,那么我可以使用
res_all = []
for res in results:
low, upp = res.confint().T # unpack columns
res_all.append(numpy.concatenate(([res.llf], res.params, res.tvalues, res.pvalues,
low, upp)))
But it might be better to align with pandas, depending on what structure you have across models.
但与Pandas对齐可能会更好,具体取决于您跨模型的结构。
You could write a helper function that takes all the results from the results instance and concatenates them in a row.
您可以编写一个辅助函数,该函数从结果实例中获取所有结果并将它们串联起来。
(I'm not sure what's the most convenient for writing to csv by rows)
(我不确定按行写入 csv 最方便的是什么)
edit:
编辑:
Here is an example storing the regression results in a dataframe
这是将回归结果存储在数据框中的示例
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/multilinear.py#L21
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/multilinear.py#L21
the loop is on line 159.
循环在第 159 行。
summary() and similar code outside of statsmodels, for example http://johnbeieler.org/py_apsrtable/for combining several results, is oriented towards printing and not to store variables.
summary() 和 statsmodels 之外的类似代码,例如http://johnbeieler.org/py_apsrtable/用于组合多个结果,面向打印而不是存储变量。
回答by Atendra
- results.params : for coefficient
- results.pvalues : for p-values
- results.params : 系数
- results.pvalues : 对于 p 值
BTW you can use dir(results) to find out all the attribute of an object
顺便说一句,您可以使用 dir(results) 找出对象的所有属性
回答by Afflatus
I found this formulation to be a little more straightforward. You can add/subtract columns by following the syntax from the examples (pvals,coeff,conf_lower,conf_higher).
我发现这个公式更简单一些。您可以按照示例 (pvals,coeff,conf_lower,conf_higher) 中的语法添加/减去列。
import pandas as pd #This can be left out if already present...
def results_summary_to_dataframe(results):
'''This takes the result of an statsmodel results table and transforms it into a dataframe'''
pvals = results.pvalues
coeff = results.params
conf_lower = results.conf_int()[0]
conf_higher = results.conf_int()[1]
results_df = pd.DataFrame({"pvals":pvals,
"coeff":coeff,
"conf_lower":conf_lower,
"conf_higher":conf_higher
})
#Reordering...
results_df = results_df[["coeff","pvals","conf_lower","conf_higher"]]
return results_df
回答by swu4
write_path = '/my/path/here/output.csv'
with open(write_path, 'w') as f:
f.write(result.summary().as_csv())

