使用 Python 和 Pandas 对具有不同列名的 statsmodels.formula 数据使用 predict()

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29020070/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:02:43  来源:igfitidea点击:

Using predict() on statsmodels.formula data with different column names using Python and Pandas

pythonnumpypandasstatsmodels

提问by kuzzooroo

I've got some regressions results from running statsmodels.formula.api.ols. Here's a toy example:

我从运行中得到了一些回归结果statsmodels.formula.api.ols。这是一个玩具示例:

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

example_df = pd.DataFrame(np.random.randn(10, 3))
example_df.columns = ["a", "b", "c"]
fit = smf.ols('a ~ b', example_df).fit()

I'd like to apply the model to column c, but a naive attempt to do so doesn't work:

我想将该模型应用于 column c,但天真地尝试这样做是行不通的:

fit.predict(example_df["c"])

Here's the exception I get:

这是我得到的例外:

PatsyError: Error evaluating factor: NameError: name 'b' is not defined
    a ~ b
        ^

I can do something gross and create a new, temporary DataFramein which I rename the column of interest:

我可以做一些粗暴的事情并创建一个新的、临时的,DataFrame在其中重命名感兴趣的列:

example_df2 = pd.DataFrame(example_df["c"])
example_df2.columns = ["b"]
fit.predict(example_df2)

Is there a cleaner way to do this? (short of switching to statsmodels.apiinstead of statsmodels.formula.api)

有没有更干净的方法来做到这一点?(短切换到statsmodels.api而不是statsmodels.formula.api

采纳答案by Josef

You can use a dictionary:

您可以使用字典:

>>> fit.predict({"b": example_df["c"]})
array([ 0.84770672, -0.35968269,  1.19592387, -0.77487812, -0.98805215,
        0.90584753, -0.15258093,  1.53721494, -0.26973941,  1.23996892])

or create a numpy array for the prediction, although that is much more complicated if there are categorical explanatory variables:

或者为预测创建一个 numpy 数组,尽管如果有分类解释变量,这会复杂得多:

>>> fit.predict(sm.add_constant(example_df["c"].values), transform=False)
array([ 0.84770672, -0.35968269,  1.19592387, -0.77487812, -0.98805215,
        0.90584753, -0.15258093,  1.53721494, -0.26973941,  1.23996892])

回答by Primer

If you replace your fitdefinition with this line:

如果你fit用这一行替换你的定义:

fit = smf.ols('example_df.a ~ example_df.b', example_df).fit()

It should work.

它应该工作。

fit.predict(example_df["c"])

array([-0.52664491, -0.53174346, -0.52172484, -0.52819856, -0.5253607 ,
       -0.52391618, -0.52800043, -0.53350634, -0.52362988, -0.52520823])