Python + Pandas 的差异

Question

提问by pceccon

I'm trying to perform a Difference in Differences(with panel data and fixed effects) analysis using Python and Pandas. I have no background in Economics and I'm just trying to filter the data and run the method that I was told to. However, as far as I could learn, I understood that the basic diff-in-diffs model looks like this:

我正在尝试使用 Python 和 Pandas进行差异分析（使用面板数据和固定效果）。我没有经济学背景，我只是想过滤数据并运行我被告知的方法。但是，据我所知，我了解到基本的 diff-in-diffs 模型如下所示：

I.e., I am dealing with a multivariable model.

即，我正在处理一个多变量模型。

Here it follows a simple example in R:

这里遵循 R 中的一个简单示例：

https://thetarzan.wordpress.com/2011/06/20/differences-in-differences-estimation-in-r-and-stata/

As it can be seen, the regression takes as input one dependent variable and tree sets of observations.

可以看出，回归将一个因变量和一组观察值作为输入。

My input data looks like this:

我的输入数据如下所示：

    Name    Permits_13  Score_13    Permits_14  Score_14    Permits_15  Score_15
0   P.S. 015 ROBERTO CLEMENTE   12.0    284 22  279 32  283
1   P.S. 019 ASHER LEVY 18.0    296 51  301 55  308
2   P.S. 020 ANNA SILVER    9.0 294 9   290 10  293
3   P.S. 034 FRANKLIN D. ROOSEVELT  3.0 294 4   292 1   296
4   P.S. 064 ROBERT SIMON   3.0 287 15  288 17  291
5   P.S. 110 FLORENCE NIGHTINGALE   0.0 313 3   306 4   308
6   P.S. 134 HENRIETTA SZOLD    4.0 290 12  292 17  288
7   P.S. 137 JOHN L. BERNSTEIN  4.0 276 12  273 17  274
8   P.S. 140 NATHAN STRAUS  13.0    282 37  284 59  284
9   P.S. 142 AMALIA CASTRO  7.0 290 15  285 25  284
10  P.S. 184M SHUANG WEN    5.0 327 12  327 9   327

Through some research I found that this is the way to use fixed effects and panel data with Pandas:

通过一些研究，我发现这是在 Pandas 中使用固定效果和面板数据的方法：

Fixed effect in Pandas or Statsmodels

Pandas 或 Statsmodels 中的固定效果

I performed some transformations to get a Multi-index data:

我执行了一些转换以获得多索引数据：

rng = pandas.date_range(start=pandas.datetime(2013, 1, 1), periods=3, freq='A')
index = pandas.MultiIndex.from_product([rng, df['Name']], names=['date', 'id'])
d1 = numpy.array(df.ix[:, ['Permits_13', 'Score_13']])
d2 = numpy.array(df.ix[:, ['Permits_14', 'Score_14']])
d3 = numpy.array(df.ix[:, ['Permits_15', 'Score_15']])
data = numpy.concatenate((d1, d2, d3), axis=0)
s = pandas.DataFrame(data, index=index)  
s = s.astype('float')

However, I didn't get how to pass all this variables to the model, such as can be done in R:

但是，我不知道如何将所有这些变量传递给模型，例如可以在 R 中完成：

reg1 = lm(work ~ post93 + anykids + p93kids.interaction, data = etc)

Here, 13, 14, 15 represents data for 2013, 2014, 2015, which I believe should be used to create a panel. I called the model like this:

此处，13、14、15 表示 2013、2014、2015 年的数据，我认为应该使用这些数据来创建面板。我这样称呼模型：

reg  = PanelOLS(y=s['y'],x=s[['x']],time_effects=True)

And this is the result:

这是结果：

I was told (by an economist) that this doesn't seem to be running with fixed effects.

我（一位经济学家）告诉我，这似乎没有固定效应。

--EDIT--

- 编辑 -

What I want to verify is the effects of the number of permits on the score, given the time. The number of the permits is the treatment, it's an intensive treatment.

我想验证的是许可数量对分数的影响，给定时间。许可证的数量是治疗，这是一种强化治疗。

A sample of the code can be found here: https://www.dropbox.com/sh/ped312ur604357r/AACQGloHDAy8I2C6HITFzjqza?dl=0.

可以在此处找到代码示例：https: //www.dropbox.com/sh/ped312ur604357r/AACQGloHDAy8I2C6HITFzjqza?dl=0。

Answer 1

采纳答案by etna

It seems that what you need are not difference in differences (DD) regressions. DD regressions are relevant when you can distinguish a control group and a treatment group. A standard simplified example would be the evaluation of a medecine. You split a population of sick people in two groups. Half of them are given nothing: they are the control group. The other half are given a medicine: they are the treatment group. Essentially, the DD regression will capture the fact that the real effect of the medicine is not directly measurable in terms of how many people who were given the medicine got healthy. Intuitively, you want to know if these people did better than the ones who were not given any medicine. This result could be refined by adding yet another category: a placebo one i.e. people who are given something which looks like a medicine but actually isn't... but again this would be a well defined group. Last but not least, for a DD regression to be really appropriate, you need to make sure groups are not heterogeneous in a way that could bias results. A bad situation for your medicine test would be if the treatment group includes only people who are young and super fit (hence more likely to heal in general), while the control group is a bunch of old alcoholics...

似乎您需要的不是差异 (DD) 回归中的差异。当您可以区分对照组和治疗组时，DD 回归是相关的。一个标准的简化示例是对药物的评估。你把一群病人分成两组。他们中的一半什么也没得到：他们是对照组。另一半被给予药物：他们是治疗组。从本质上讲，DD 回归将捕捉到这样一个事实，即药物的实际效果无法直接通过服用药物的人数得到健康来衡量。直觉上，您想知道这些人是否比未服用任何药物的人做得更好。这个结果可以通过添加另一个类别来完善：安慰剂，即那些被给予看起来像药物但实际上不是……但同样这将是一个明确定义的群体。最后但并非最不重要的一点是，要使 DD 回归真正合适，您需要确保组的异质性不会导致结果产生偏差。药物测试的一个糟糕情况是，如果治疗组仅包括年轻且超级健康的人（因此通常更有可能治愈），而对照组则是一群年老的酗酒者……

In your case, if I'm not mistaken, everybody gets "treated" to some extent... so you are closer to a standard regression framework where the impact of X on Y (e.g. IQ on wage) is to be measured. I understand that you want to measure the impact of the number of permits on the score (or is it the other way? -_-), and you have classical endogeneity to deal with i.e. if Peter is more skilled than Paul, he'll typically obtain more permits AND a higher score. So what you actually want to use is the fact that with the same level of skill over time, Peter (respectively Paul) will be "given" different levels of permits over years... and there you'll really measure the influence of permits on score...

在你的情况下，如果我没记错的话，每个人都会在某种程度上得到“对待”……所以你更接近于一个标准的回归框架，其中要衡量 X 对 Y 的影响（例如智商对工资）。我知道您想衡量许可证数量对分数的影响（或者是其他方式？-_-），并且您需要处理经典的内生性，即如果彼得比保罗更熟练，他会通常获得更多的许可和更高的分数。所以你真正想要使用的是这样一个事实，随着时间的推移，随着时间的推移，彼得（分别是保罗）将“给予”不同级别的许可证......在那里你将真正衡量许可证的影响得分...

I might not be guessing well, but I want to insist on the fact that there are many ways to obtain biased, hence meaningless results, if you don't put enough efforts to understand/explain what's going on in the data. Regarding technical details, your estimation only have year fixed effects (likely not estimated but taken into account through demeaning, hence not returned in the output), so what you want to do is to add entity_effects = True. If you want to go further... I'm afraid panel data regressions are not well covered in any Python package so far, (including statsmodels which if the reference for econometrics) so if you're not willing to invest... I would rather suggest using R or Stata. Meanwhile, if a Fixed Effect regression is all you need, you can also get it with statsmodels (which also allows to cluster standard errors if needed...):

我可能猜得不好，但我想坚持一个事实，即如果您没有付出足够的努力来理解/解释数据中发生的事情，则有很多方法可以获得有偏见的、因此毫无意义的结果。关于技术细节，您的估计仅具有年份固定效应（可能未估计但通过贬低考虑在内，因此未在输出中返回），因此您要做的是添加entity_effects = True. 如果你想更进一步......恐怕到目前为止，任何 Python 包都没有很好地涵盖面板数据回归（包括 statsmodels，如果你不想投资的话......我宁愿建议使用 R 或 Stata。同时，如果您只需要固定效应回归，您也可以使用 statsmodels 来获得它（如果需要，它还允许对标准误差进行聚类......）：

import statsmodels.formula.api as smf
df = s.reset_index(drop = False)
reg = smf.ols('y ~ x + C(date) + C(id)',
              data = df).fit()
print(reg.summary())
# clustering standard errors at individual level
reg_cl = smf.ols(formula='y ~ x + C(date) + C(id)',
                 data=df).fit(cov_type='cluster',
                              cov_kwds={'groups': df['id']})
print(reg_cl.summary())
# output only coeff and standard error of x
print(u'{:.3f} ({:.3f})'.format(reg.params.ix['x'], reg.bse.ix['x']))
print(u'{:.3f} ({:.3f})'.format(reg_cl.params.ix['x'], reg_cl.bse.ix['x']))

Regarding econometrics, you'll likely get more/better answers on Cross Validated than here.

关于计量经济学，您可能会在 Cross Validated 上获得比此处更多/更好的答案。

Python + Pandas 的差异

提问by pceccon

采纳答案by etna

相关推荐

最近更新

标签

Python + Pandas 的差异

提问by pceccon

采纳答案by etna

相关推荐

Pandas 应用 lambda 函数空值

从 Dataframe 中所有列的列名中删除最后两个字符 - Pandas

pandas ValueError：数组长度与索引长度不匹配

pandas 如何将列表列表转换为数据框并将列表的第一个元素作为索引

相关推荐

最近更新

标签