从 Pandas DataFrame 计算 pvalue

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50137024/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:31:18  来源:igfitidea点击:

Calculate pvalue from pandas DataFrame

pythonpandasdataframescipystatistics

提问by Gabriela Catalina

I have a DataFrame stats with a Multindex and 8 samples (only two shown here) and 8 genes for each sample.

我有一个包含 Multindex 和 8 个样本(此处仅显示两个)和每个样本的 8 个基因的 DataFrame 统计数据。

 In[13]:stats
    Out[13]: 
                       ARG/16S                                            \
                         count          mean           std           min   
    sample      gene                                                       
    Arnhem      IC        11.0  2.319050e-03  7.396130e-04  1.503150e-03   
                Int1      11.0  7.243040e+00  6.848327e+00  1.364879e+00   
                Sul1      11.0  3.968956e-03  9.186019e-04  2.499074e-03   
                TetB       2.0  1.154748e-01  1.627663e-01  3.816936e-04   
                TetM       4.0  1.083125e-04  5.185259e-05  5.189226e-05   
                blaOXA     4.0  4.210963e-06  3.783235e-07  3.843571e-06   
                ermB       4.0  4.111081e-05  7.894879e-06  3.288865e-05   
                ermF       4.0  2.335210e-05  4.519758e-06  1.832037e-05   
    Basel       Aph3a      4.0  7.815592e-06  1.757242e-06  5.539389e-06   
                IC        11.0  5.095161e-03  5.639278e-03  1.302205e-03   
                Int1      12.0  1.333068e+01  1.872207e+01  4.988048e-02   
                Sul1      11.0  1.618617e-02  1.988817e-02  2.970397e-03   

I'm trying to calculate the p-value (Students t-test) for each of these samples, comparing each of the genes between them.

我正在尝试计算每个样本的 p 值(学生 t 检验),比较它们之间的每个基因。

I've used scipy.stats.ttest_ind_from_stats but I managed to get the p-values for the different samples for one gene and only those of the samples neighboring each other.

我使用过 scipy.stats.ttest_ind_from_stats 但我设法获得了一个基因的不同样本的 p 值,并且只有那些彼此相邻的样本的 p 值。

Experiments = list(values1_16S['sample'].unique())
for exp in Experiments:
    if Experiments.index(exp)<len(Experiments)-1:
        second = Experiments[Experiments.index(exp)+1]
    else:
        second = Experiments[0]
    tstat, pvalue = scipy.stats.ttest_ind_from_stats(stats.loc[(exp,'Sul1')]['ARG/16S','mean'],
                                    stats.loc[(exp,'Sul1')]['ARG/16S','std'],
                                    stats.loc[(exp,'Sul1')]['ARG/16S','count'],
                                    stats.loc[(second,'Sul1')]['ARG/16S','mean'],
                                    stats.loc[(second,'Sul1')]['ARG/16S','std'],
                                    stats.loc[(second,'Sul1')]['ARG/16S','count'])
    d.append({'loc1':exp, 'loc2':second, 'pvalue':pvalue})


stats_Sul1 = pd.DataFrame(d)
stats_Sul1

How can I get the pvalues between ALL samples? And is there a way to do this for all genes at once without running the code one by one for each gene?

如何获得所有样本之间的 pvalues?有没有一种方法可以一次对所有基因执行此操作,而无需为每个基因逐一运行代码?

采纳答案by Ben.T

Let's suppose you have the same X genes for the Y samples. I try my method with X=3 and Y=2 but I guess you can generalize. I started with:

假设您对 Y 样本具有相同的 X 基因。我用 X=3 和 Y=2 尝试我的方法,但我想你可以概括。我开始于:

df1 = 
             count       mean        std       min
sample gene                                       
Arnhem IC       11   0.002319   0.000740  0.001503
       Int1     11   7.243040   6.848327  1.364879
       Sul1     11   0.003969   0.000919  0.002499
Basel  IC       11   0.005095   0.005639  0.001302
       Int1     12  13.330680  18.722070  0.049880
       Sul1     11   0.016186   0.019888  0.002970

Note that the genes need to be in the same order. First reset_index()with df_reindex = df1.reset_index(), I'm not sure what I'm doing is possible with multiindex:

请注意,基因需要按相同顺序排列。首先reset_index()df_reindex = df1.reset_index(),我不知道我在做什么是可能的多指标:

df_reindex =
   sample  gene  count       mean        std       min
0  Arnhem    IC     11   0.002319   0.000740  0.001503
1  Arnhem  Int1     11   7.243040   6.848327  1.364879
2  Arnhem  Sul1     11   0.003969   0.000919  0.002499
3   Basel    IC     11   0.005095   0.005639  0.001302
4   Basel  Int1     12  13.330680  18.722070  0.049880
5   Basel  Sul1     11   0.016186   0.019888  0.002970

I create a rolled DF and join it to df_reindex:

我创建了一个滚动的 DF 并将其加入df_reindex

nb_genes = 3
df_rolled = pd.DataFrame(pd.np.roll(df_reindex,nb_genes,0), columns = df_reindex.columns)
df_joined = df_reindex.join(df_rolled, rsuffix='_')
# rsuffix='_' is to be able to perform the join

Now on a same row, I have all data you needto calculate pvalueand create the column with apply:

现在在同一行,我有你需要计算pvalue和创建列的所有数据apply

df_joined['pvalue'] = df_joined.apply(lambda x: stats.ttest_ind_from_stats(x['mean'],x['std'],x['count'], x['mean_'],x['std_'],x['count_'])[1],axis=1)

Finally, I create a DF with the data you want and rename columns:

最后,我用你想要的数据创建一个 DF 并重命名列:

df_output = df_joined[['sample','sample_','gene','pvalue']].rename(columns = {'sample':'loc1', 'sample_':'loc2'})

You ends up with data:

你最终得到数据:

df_output = 
     loc1    loc2  gene    pvalue
0  Arnhem   Basel    IC  0.121142
1  Arnhem   Basel  Int1  0.321072
2  Arnhem   Basel  Sul1  0.055298
3   Basel  Arnhem    IC  0.121142
4   Basel  Arnhem  Int1  0.321072
5   Basel  Arnhem  Sul1  0.055298

That you can reindex as you need.

您可以根据需要重新索引。

If you want to do it each sample against each other, I think a loop forcould do it.

如果你想对每个样本进行对比,我认为循环for可以做到。

EDIT:Using pivot_table, I think there is a easier way.

编辑:使用pivot_table,我认为有一种更简单的方法。

With your input statsas multiindex table for only ARG/16S(not sure how to handle this level), so I start with (which might be your stats['ARG/16S']):

仅将您的输入stats作为多索引表ARG/16S(不确定如何处理此级别),所以我从(可能是您的stats['ARG/16S'])开始:

df=
               count       mean           std       min
sample gene                                            
Arnhem IC         11   0.002319  7.396130e-04  0.001503
       Int1       11   7.243040  6.848327e+00  1.364879
       Sul1       11   0.003969  9.186019e-04  0.002499
       TetB        2   0.115475  1.627663e-01  0.000382
       TetM        4   0.000108  5.185259e-05  0.000052
       blaOXA      4   0.000004  3.783235e-07  0.000004
       ermB        4   0.000041  7.894879e-06  0.000033
       ermF        4   0.000023  4.519758e-06  0.000018
Basel  Aph3a       4   0.000008  1.757242e-06  0.000006
       IC         11   0.005095  5.639278e-03  0.001302
       Int1       12  13.330680  1.872207e+01  0.049880
       Sul1       11   0.016186  1.988817e-02  0.002970

With the function pivot_table, you can rearrange your data such as:

使用函数pivot_table,您可以重新排列数据,例如:

df_pivot = df.pivot_table(values = ['count','mean','std'], index = 'gene', 
                               columns = 'sample', fill_value = 0)

In this df_pivot(I don't print it here for readability but at the end with the new column), you can create a column for each couple (sample1, sample2) using itertoolsand apply:

在此df_pivot(我不会在这里打印它以提高可读性,但在新列的末尾),您可以使用itertoolsand为每对夫妇(sample1、sample2)创建一列apply

import itertools
for sample1, sample2 in itertools.combinations(df.index.levels[0],2):
    # itertools.combinations create all combinations between your samples
    df_pivot[sample1+ '_' + sample2 ] = df_pivot.apply(lambda x: stats.ttest_ind_from_stats(x['mean'][sample1],x['std'][sample1],x['count'][sample1], 
                                                                                        x['mean'][sample2 ],x['std'][sample2 ],x['count'][sample2 ],)[1],axis=1).fillna(1)

I think this method is independent of the number of samples, genes and if genes are not all the same, you ends up with df_pivotlike:

我认为这种方法与样本数量、基因数量无关,如果基因不完全相同,您最终会得到df_pivot如下结果:

        count            mean                      std            Arnhem_Basel
sample Arnhem Basel    Arnhem      Basel        Arnhem      Basel             
gene                                                                          
Aph3a       0     4  0.000000   0.000008  0.000000e+00   0.000002     1.000000
IC         11    11  0.002319   0.005095  7.396130e-04   0.005639     0.121142
Int1       11    12  7.243040  13.330680  6.848327e+00  18.722070     0.321072
Sul1       11    11  0.003969   0.016186  9.186019e-04   0.019888     0.055298
TetB        2     0  0.115475   0.000000  1.627663e-01   0.000000     1.000000
TetM        4     0  0.000108   0.000000  5.185259e-05   0.000000     1.000000
blaOXA      4     0  0.000004   0.000000  3.783235e-07   0.000000     1.000000
ermB        4     0  0.000041   0.000000  7.894879e-06   0.000000     1.000000
ermF        4     0  0.000023   0.000000  4.519758e-06   0.000000     1.000000

Let me know if it works

让我知道它是否有效

EDIT2:to reply to the comment, I think you can do this:

EDIT2:回复评论,我认为你可以这样做:

No change for df_pivotand then you create a multiindex DF df_multito write your results in:

没有改变df_pivot,然后你创建一个多索引 DFdf_multi来写入你的结果:

df_multi = pd.DataFrame(index = df.index.levels[1], 
                        columns = pd.MultiIndex.from_tuples([p for p in itertools.combinations(df.index.levels[0],2)])).fillna(0)

Then you use the loop forto implement the data in this df_multi:

然后你使用循环for来实现这个数据df_multi

for sample1, sample2 in itertools.combinations(df.index.levels[0],2):
    # itertools.combinations create all combinations between your samples
    df_multi.loc[:,(sample1,sample2)] = df_pivot.apply(lambda x: stats.ttest_ind_from_stats(x['mean'][sample1],x['std'][sample1],x['count'][sample1], 
                                                                                        x['mean'][sample2 ],x['std'][sample2 ],x['count'][sample2 ],)[1],axis=1).fillna(1)

Finally, you can use transposeand unstackon level 1 to get the way you ask (or close if I misunderstood)

最后,您可以在级别 1 上使用transposeunstack来获得您询问的方式(如果我误解了,请关闭)

df_output = df_multi.transpose().unstack(level=[1]).fillna(1)

You will see that you don't have the last sample in indexes and first sample in columns (because they don't exist how I built everything) if you want them, you need to replace itertools.combinationsby itertools.combinations_with_replacementin both the creation of df_multiand in the loop for( I didn't try it but it should work)

你会看到,你不必在列索引中最后一个样本和第一个样品(因为它们不存在我如何建立一切),如果你想要他们,你需要更换itertools.combinationsitertools.combinations_with_replacement两个创作的df_multi,并在环for(我没试过,但应该可以)