pandas 如何在熊猫数据框中按组进行 t 检验?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45015038/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to t-test by group in a pandas dataframe?
提问by Rachel
I have quite a huge pandas dataframe with many columns. The dataframe contains two groups. It is basically setup as follows:
我有一个很大的 Pandas 数据框,有很多列。数据框包含两组。基本设置如下:
import pandas as pd
csv = [{"air" : 0.47,"co2" : 0.43 , "Group" : 1}, {"air" : 0.77,"co2" : 0.13 , "Group" : 1}, {"air" : 0.17,"co2" : 0.93 , "Group" : 2} ]
df = pd.DataFrame(csv)
I want to perform a t-test paired t-test on air
and co2
thereby compare the two groups Group = 1
and Group = 2
.
我想对 t 检验进行配对 t 检验air
,co2
从而比较两组Group = 1
和Group = 2
.
I have many many more columns than just air
co2
- hence, I would like to find a procedure that works for all columns int the dataframe. I believe, I could use scipy.stats.ttest_rel
together with pd.groupby
oder apply
. How would that work? Thanks in advance /R
我有更多的列air
co2
- 因此,我想找到一个适用于数据帧中所有列的过程。我相信,我可以scipy.stats.ttest_rel
与pd.groupby
oder一起使用apply
。这将如何运作?提前致谢/R
回答by error
I would use pandas dataframe.where method.
我会使用pandas dataframe.where 方法。
group1_air = df.where(df.Group== 1).dropna()['air']
group2_air = df.where(df.Group== 2).dropna()['air']
This bit of code returns into group1_air all the values of the air column where the group column is 1 and all the values of air where group is 2 in group2_air.
The drop.na()
is required because the .where
method will return NAN for every row in which the specified conditions is not met. So all rows where group is 2 will return with NAN values when you use df.where(df.Group== 1)
.
这段代码将 group 列为 1 的 air 列的所有值和 group2_air 中 group 为 2 的 air 列的所有值返回到 group1_air 中。本drop.na()
因为需要.where
方法将返回NAN每一个在其指定的条件不满足行。因此,当您使用df.where(df.Group== 1)
.
Whether you need to use scipy.stats.ttest_rel
or scipy.stats.ttest_ind
depends on your groups. If you samples are from independent groups you should use ttest_ind
if your samples are from related groups you should use ttest_rel
.
是否需要使用scipy.stats.ttest_rel
或scipy.stats.ttest_ind
取决于您的组。如果您的样本来自独立组,则应使用,ttest_ind
如果您的样本来自相关组,则应使用ttest_rel
.
So if your samples are independent from oneanother your final piece of required code is.
因此,如果您的样本彼此独立,那么您所需的最后一段代码就是。
scipy.stats.ttest_ind(group1_air,group2_air)
else you need to use
否则你需要使用
scipy.stats.ttest_rel(group1_air,group2_air)
When you want to also test co2 you simply need to change air for co2 in the given example.
当您还想测试 co2 时,您只需在给定的示例中将空气更改为 co2。
Edit:
编辑:
This is a rough sketch of the code you should run to execute ttests over every column in your dataframe except for the group column. You may need to tamper a bit with the column_list
to get it completely compliant with your needs (you may not want to loop over every column for example).
这是您应该运行的代码的粗略草图,以对数据框中的每一列(组列除外)执行测试。您可能需要对 进行一些改动column_list
以使其完全符合您的需求(例如,您可能不想遍历每一列)。
# get a list of all columns in the dataframe without the Group column
column_list = [x for x in df.columns if x != 'Group']
# create an empty dictionary
t_test_results = {}
# loop over column_list and execute code explained above
for column in column_list:
group1 = df.where(df.Group== 1).dropna()[column]
group2 = df.where(df.Group== 2).dropna()[column]
# add the output to the dictionary
t_test_results[column] = scipy.stats.ttest_ind(group1,group2)
results_df = pd.DataFrame.from_dict(t_test_results,orient='Index')
results_df.columns = ['statistic','pvalue']
At the end of this code you have a dataframe with the output of the ttest over every column you will have looped over.
在此代码的末尾,您有一个数据框,其中包含将循环遍历的每一列的 ttest 输出。