Python、Pandas 和卡方独立性检验
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43963606/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python, Pandas & Chi-Squared Test of Independence
提问by Mia
I am quite new to Python as well as Statistics. I'm trying to apply the Chi Squared Test to determine whether previous success affects the level of change of a person (percentage wise, this does seem to be the case, but I wanted to see whether my results were statistically significant).
我对 Python 和统计都很陌生。我正在尝试应用卡方检验来确定以前的成功是否会影响一个人的变化水平(百分比方面,情况似乎确实如此,但我想看看我的结果是否具有统计意义)。
My question is: Did I do this correctly? My results say the p-value is 0.0, which means that there is a significant relationship between my variables (which is what I want of course...but 0 seems a little bit too perfect for a p-value, so I'm wondering whether I did it incorrectly coding wise).
我的问题是:我这样做是否正确?我的结果说 p 值为 0.0,这意味着我的变量之间存在显着关系(这当然是我想要的……但 0 对于 p 值来说似乎有点太完美了,所以我想知道我是否在编码方面做得不正确)。
Here's what I did:
这是我所做的:
import numpy as np
import pandas as pd
import scipy.stats as stats
d = {'Previously Successful' : pd.Series([129.3, 182.7, 312], index=['Yes - changed strategy', 'No', 'col_totals']),
'Previously Unsuccessful' : pd.Series([260.17, 711.83, 972], index=['Yes - changed strategy', 'No', 'col_totals']),
'row_totals' : pd.Series([(129.3+260.17), (182.7+711.83), (312+972)], index=['Yes - changed strategy', 'No', 'col_totals'])}
total_summarized = pd.DataFrame(d)
observed = total_summarized.ix[0:2,0:2]
Output: Observed
输出: 观察到
expected = np.outer(total_summarized["row_totals"][0:2],
total_summarized.ix["col_totals"][0:2])/1000
expected = pd.DataFrame(expected)
expected.columns = ["Previously Successful","Previously Unsuccessful"]
expected.index = ["Yes - changed strategy","No"]
chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()
print(chi_squared_stat)
crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
df = 8) # *
print("Critical value")
print(crit)
p_value = 1 - stats.chi2.cdf(x=chi_squared_stat, # Find the p-value
df=8)
print("P value")
print(p_value)
stats.chi2_contingency(observed= observed)
Output Statistics
输出 统计
回答by Warren Weckesser
A few corrections:
一些更正:
- Your
expected
array is not correct. You must divide byobserved.sum().sum()
, which is 1284, not 1000. - For a 2x2 contingency table such as this, the degrees of freedom is 1, not 8.
- You calculation of
chi_squared_stat
does not include a continuity correction. (But it isn't necessarily wrong to not use it--that's a judgment call for the statistician.)
- 你的
expected
数组不正确。您必须除以observed.sum().sum()
,即 1284,而不是 1000。 - 对于像这样的 2x2 列联表,自由度是 1,而不是 8。
- 您的计算
chi_squared_stat
不包括连续性校正。(但不使用它并不一定是错误的——这是统计学家的判断力。)
All the calculations that you perform (expected matrix, statistics, degrees of freedom, p-value) are computed by chi2_contingency
:
您执行的所有计算(预期矩阵、统计量、自由度、p 值)均通过chi2_contingency
以下方式计算:
In [65]: observed
Out[65]:
Previously Successful Previously Unsuccessful
Yes - changed strategy 129.3 260.17
No 182.7 711.83
In [66]: from scipy.stats import chi2_contingency
In [67]: chi2, p, dof, expected = chi2_contingency(observed)
In [68]: chi2
Out[68]: 23.383138325890453
In [69]: p
Out[69]: 1.3273696199438626e-06
In [70]: dof
Out[70]: 1
In [71]: expected
Out[71]:
array([[ 94.63757009, 294.83242991],
[ 217.36242991, 677.16757009]])
By default, chi2_contingency
uses a continuity correction when the contingency table is 2x2. If you prefer to not use the correction, you can disable it with the argument correction=False
:
默认情况下,chi2_contingency
当列联表为 2x2 时使用连续校正。如果您不想使用更正,可以使用参数禁用它correction=False
:
In [73]: chi2, p, dof, expected = chi2_contingency(observed, correction=False)
In [74]: chi2
Out[74]: 24.072616672232893
In [75]: p
Out[75]: 9.2770200776879643e-07
回答by Ashutosh
degrees of freedom = (row-1)x(column-1). For a 2x2 table it is (2-1)x(2-1) = 1
自由度 = (row-1)x(column-1)。对于 2x2 表,它是 (2-1)x(2-1) = 1