Python、Pandas 和卡方独立性检验

Question

提问by Mia

I am quite new to Python as well as Statistics. I'm trying to apply the Chi Squared Test to determine whether previous success affects the level of change of a person (percentage wise, this does seem to be the case, but I wanted to see whether my results were statistically significant).

我对 Python 和统计都很陌生。我正在尝试应用卡方检验来确定以前的成功是否会影响一个人的变化水平（百分比方面，情况似乎确实如此，但我想看看我的结果是否具有统计意义）。

My question is: Did I do this correctly? My results say the p-value is 0.0, which means that there is a significant relationship between my variables (which is what I want of course...but 0 seems a little bit too perfect for a p-value, so I'm wondering whether I did it incorrectly coding wise).

我的问题是：我这样做是否正确？我的结果说 p 值为 0.0，这意味着我的变量之间存在显着关系（这当然是我想要的……但 0 对于 p 值来说似乎有点太完美了，所以我想知道我是否在编码方面做得不正确）。

Here's what I did:

这是我所做的：

import numpy as np
import pandas as pd
import scipy.stats as stats

d = {'Previously Successful' : pd.Series([129.3, 182.7, 312], index=['Yes - changed strategy', 'No', 'col_totals']),
 'Previously Unsuccessful' : pd.Series([260.17, 711.83, 972], index=['Yes - changed strategy', 'No', 'col_totals']),
 'row_totals' : pd.Series([(129.3+260.17), (182.7+711.83), (312+972)], index=['Yes - changed strategy', 'No', 'col_totals'])}

total_summarized = pd.DataFrame(d)

observed = total_summarized.ix[0:2,0:2]

Output: Observed

输出：观察到

expected =  np.outer(total_summarized["row_totals"][0:2],
                 total_summarized.ix["col_totals"][0:2])/1000

expected = pd.DataFrame(expected)

expected.columns = ["Previously Successful","Previously Unsuccessful"]
expected.index = ["Yes - changed strategy","No"]

chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()

print(chi_squared_stat)

crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
                  df = 8)   # *

print("Critical value")
print(crit)

p_value = 1 - stats.chi2.cdf(x=chi_squared_stat,  # Find the p-value
                         df=8)
print("P value")
print(p_value)

stats.chi2_contingency(observed= observed)

Output Statistics

输出统计

Answer 1

回答by Warren Weckesser

A few corrections:

一些更正：

Your expectedarray is not correct. You must divide by observed.sum().sum(), which is 1284, not 1000.
For a 2x2 contingency table such as this, the degrees of freedom is 1, not 8.
You calculation of chi_squared_statdoes not include a continuity correction. (But it isn't necessarily wrong to not use it--that's a judgment call for the statistician.)

你的expected数组不正确。您必须除以observed.sum().sum()，即 1284，而不是 1000。
对于像这样的 2x2 列联表，自由度是 1，而不是 8。
您的计算chi_squared_stat不包括连续性校正。（但不使用它并不一定是错误的——这是统计学家的判断力。）

All the calculations that you perform (expected matrix, statistics, degrees of freedom, p-value) are computed by chi2_contingency:

您执行的所有计算（预期矩阵、统计量、自由度、p 值）均通过chi2_contingency以下方式计算：

In [65]: observed
Out[65]: 
                        Previously Successful  Previously Unsuccessful
Yes - changed strategy                  129.3                   260.17
No                                      182.7                   711.83

In [66]: from scipy.stats import chi2_contingency

In [67]: chi2, p, dof, expected = chi2_contingency(observed)

In [68]: chi2
Out[68]: 23.383138325890453

In [69]: p
Out[69]: 1.3273696199438626e-06

In [70]: dof
Out[70]: 1

In [71]: expected
Out[71]: 
array([[  94.63757009,  294.83242991],
       [ 217.36242991,  677.16757009]])

By default, chi2_contingencyuses a continuity correction when the contingency table is 2x2. If you prefer to not use the correction, you can disable it with the argument correction=False:

默认情况下，chi2_contingency当列联表为 2x2 时使用连续校正。如果您不想使用更正，可以使用参数禁用它correction=False：

In [73]: chi2, p, dof, expected = chi2_contingency(observed, correction=False)

In [74]: chi2
Out[74]: 24.072616672232893

In [75]: p
Out[75]: 9.2770200776879643e-07

Answer 2

回答by Ashutosh

degrees of freedom = (row-1)x(column-1). For a 2x2 table it is (2-1)x(2-1) = 1

自由度 = (row-1)x(column-1)。对于 2x2 表，它是 (2-1)x(2-1) = 1

Python、Pandas 和卡方独立性检验

提问by Mia

回答by Warren Weckesser

回答by Ashutosh

相关推荐

最近更新

标签

Python、Pandas 和卡方独立性检验

提问by Mia

回答by Warren Weckesser

回答by Ashutosh

相关推荐

pandas 导入错误：无法导入名称“PandasError”

pandas 如何使用python pandas读取json文件？

如何遍历 Pandas DataFrameGroupBy 并选择特定列的每个分组变量的所有条目？

pandas SKLearn MinMaxScaler - 仅缩放特定列

相关推荐

最近更新

标签