pandas Python 2.7 - statsmodels - 格式化和编写摘要输出

Question

提问by DMML

I'm doing logistic regression using pandas 0.11.0(data handling) and statsmodels 0.4.3to do the actual regression, on Mac OSX Lion.

我正在使用pandas 0.11.0（数据处理）进行逻辑回归并statsmodels 0.4.3在 Mac OSX Lion 上进行实际回归。

I'm going to be running ~2,900 different logistic regression models and need the results output to csv file and formatted in a particular way.

我将运行约 2,900 个不同的逻辑回归模型，需要将结果输出到 csv 文件并以特定方式格式化。

Currently, I'm only aware of doing print result.summary()which prints the results (as follows) to the shell:

目前，我只知道print result.summary()将结果（如下）打印到 shell：

 Logit Regression Results                           
  ==============================================================================
 Dep. Variable:            death_death   No. Observations:                 9752
 Model:                          Logit   Df Residuals:                     9747
 Method:                           MLE   Df Model:                            4
 Date:                Wed, 22 May 2013   Pseudo R-squ.:                -0.02672
 Time:                        22:15:05   Log-Likelihood:                -5806.9
 converged:                       True   LL-Null:                       -5655.8
                                         LLR p-value:                     1.000
 ===============================================================================
                   coef    std err          z      P>|z|      [95.0% Conf. Int.]
 -------------------------------------------------------------------------------
 age_age5064    -0.1999      0.055     -3.619      0.000        -0.308    -0.092
 age_age6574    -0.2553      0.053     -4.847      0.000        -0.359    -0.152
 sex_female     -0.2515      0.044     -5.765      0.000        -0.337    -0.166
 stage_early    -0.1838      0.041     -4.528      0.000        -0.263    -0.104
 access         -0.0102      0.001    -16.381      0.000        -0.011    -0.009
 ===============================================================================

I will also need the odds ratio, which is computed by print np.exp(result.params), and is printed in the shell as such:

我还需要优势比，它由计算print np.exp(result.params)，并在外壳中打印如下：

age_age5064    0.818842
age_age6574    0.774648
sex_female     0.777667
stage_early    0.832098
access         0.989859
dtype: float64

What I need is for these each to be written to a csv file in form of a very lon row like (am not sure, at this point, whether I will need things like Log-Likelihood, but have included it for the sake of thoroughness):

我需要的是将这些每一个都以非常长的行的形式写入一个 csv 文件（我不确定，在这一点上，我是否需要类似的东西Log-Likelihood，但为了彻底而将它包含在内）：

`Log-Likelihood, age_age5064_coef, age_age5064_std_err, age_age5064_z, age_age5064_p>|z|,...age_age6574_coef, age_age6574_std_err, ......access_coef, access_std_err, ....age_age5064_odds_ratio, age_age6574_odds_ratio, ...sex_female_odds_ratio,.....access_odds_ratio`

I think you get the picture - a very long row, with all of these actual values, and a header with all the column designations in a similar format.

我想你明白了 - 一个很长的行，包含所有这些实际值，以及一个标题，所有列名称都采用类似的格式。

I am familiar with the csv modulein Python, and am becoming more familiar with pandas. Not sure whether this info could be formatted and stored in a pandas dataframeand then written, using to_csvto a file once all ~2,900 logistic regression models have completed; that would certainly be fine. Also, writing them as each model is completed is also fine (using csv module).

我熟悉csv modulePython 中的，并且越来越熟悉pandas. 不确定此信息是否可以格式化并存储在 a 中pandas dataframe，然后在to_csv所有 ~2,900 个逻辑回归模型完成后写入文件；那当然没问题。此外，在每个模型完成时编写它们也很好（使用csv module）。

UPDATE:

更新：

So, I was looking more at statsmodels site, specifically trying to figure out how the results of a model are stored within classes. It looks like there is a class called 'Results', which will need to be used. I think using inheritance from this class to create another class, where some of the methods/operators get changed might be the way to go, in order to get the formatting I require. I have very little experience in the ways of doing this, and will need to spend quite a bit of time figuring this out (which is fine). If anybody can help/has more experience that would be awesome!

所以，我更多地关注 statsmodels 站点，特别是试图弄清楚模型的结果如何存储在类中。看起来有一个名为“Results”的类，需要使用它。我认为使用这个类的继承来创建另一个类，其中一些方法/运算符被更改可能是要走的路，以获得我需要的格式。我在这方面的经验很少，需要花很多时间来解决这个问题（这很好）。如果有人可以提供帮助/有更多经验，那就太棒了！

Here is the site where the classes are laid out: statsmodels results class

这是布置课程的网站：statsmodels results class

Answer 1

采纳答案by Josef

There is no premade table of parameters and their result statistics currently available.

目前没有预制的参数表及其结果统计数据。

Essentially you need to stack all the results yourself, whether in a list, numpy array or pandas DataFrame depends on what's more convenient for you.

本质上，您需要自己堆叠所有结果，无论是在列表、numpy 数组还是 Pandas DataFrame 中，都取决于对您来说更方便的方式。

for example, if I want one numpy array that has the results for a model, llf and results in the summary parameter table, then I could use

例如，如果我想要一个包含模型结果的 numpy 数组，llf 和汇总参数表中的结果，那么我可以使用

res_all = []
for res in results:
    low, upp = res.confint().T   # unpack columns 
    res_all.append(numpy.concatenate(([res.llf], res.params, res.tvalues, res.pvalues, 
                   low, upp)))

But it might be better to align with pandas, depending on what structure you have across models.

但与Pandas对齐可能会更好，具体取决于您跨模型的结构。

You could write a helper function that takes all the results from the results instance and concatenates them in a row.

您可以编写一个辅助函数，该函数从结果实例中获取所有结果并将它们串联起来。

(I'm not sure what's the most convenient for writing to csv by rows)

（我不确定按行写入 csv 最方便的是什么）

edit:

编辑：

Here is an example storing the regression results in a dataframe

这是将回归结果存储在数据框中的示例

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/multilinear.py#L21

the loop is on line 159.

循环在第 159 行。

summary() and similar code outside of statsmodels, for example http://johnbeieler.org/py_apsrtable/for combining several results, is oriented towards printing and not to store variables.

summary() 和 statsmodels 之外的类似代码，例如http://johnbeieler.org/py_apsrtable/用于组合多个结果，面向打印而不是存储变量。

Answer 2

回答by Atendra

results.params : for coefficient
results.pvalues : for p-values

results.params : 系数
results.pvalues : 对于 p 值

BTW you can use dir(results) to find out all the attribute of an object

顺便说一句，您可以使用 dir(results) 找出对象的所有属性

Answer 3

回答by Afflatus

I found this formulation to be a little more straightforward. You can add/subtract columns by following the syntax from the examples (pvals,coeff,conf_lower,conf_higher).

我发现这个公式更简单一些。您可以按照示例 (pvals,coeff,conf_lower,conf_higher) 中的语法添加/减去列。

import pandas as pd     #This can be left out if already present...

def results_summary_to_dataframe(results):
    '''This takes the result of an statsmodel results table and transforms it into a dataframe'''
    pvals = results.pvalues
    coeff = results.params
    conf_lower = results.conf_int()[0]
    conf_higher = results.conf_int()[1]

    results_df = pd.DataFrame({"pvals":pvals,
                               "coeff":coeff,
                               "conf_lower":conf_lower,
                               "conf_higher":conf_higher
                                })

    #Reordering...
    results_df = results_df[["coeff","pvals","conf_lower","conf_higher"]]
    return results_df

Answer 4

回答by swu4

write_path = '/my/path/here/output.csv'
with open(write_path, 'w') as f:
    f.write(result.summary().as_csv())

Answer 5

回答by Jinhua Wang

There is actually a built-in method documented in the documentation here:

此处的文档中实际上记录了一个内置方法：

f = open('csvfile.csv','w')
f.write(result.summary().as_csv())
f.close()

I believe this is a much easier (and clean) way to output the summaries to csv files.

我相信这是将摘要输出到 csv 文件的一种更简单（和干净）的方法。

pandas Python 2.7 - statsmodels - 格式化和编写摘要输出

提问by DMML

采纳答案by Josef

回答by Atendra

回答by Afflatus

回答by swu4

回答by Jinhua Wang

相关推荐

最近更新

标签

pandas Python 2.7 - statsmodels - 格式化和编写摘要输出

提问by DMML

采纳答案by Josef

回答by Atendra

回答by Afflatus

回答by swu4

回答by Jinhua Wang

相关推荐

在 Python Pandas DataFrame 中删除重复项而不删除重复项

pandas 用之前的非缺失值填充缺失的pandas数据，按key分组

pandas 按值范围对数据进行分组

pandas 在熊猫中重命名系列

相关推荐

最近更新

标签