从 Pandas DataFrame 计算 pvalue
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/50137024/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Calculate pvalue from pandas DataFrame
提问by Gabriela Catalina
I have a DataFrame stats with a Multindex and 8 samples (only two shown here) and 8 genes for each sample.
我有一个包含 Multindex 和 8 个样本(此处仅显示两个)和每个样本的 8 个基因的 DataFrame 统计数据。
In[13]:stats
Out[13]:
ARG/16S \
count mean std min
sample gene
Arnhem IC 11.0 2.319050e-03 7.396130e-04 1.503150e-03
Int1 11.0 7.243040e+00 6.848327e+00 1.364879e+00
Sul1 11.0 3.968956e-03 9.186019e-04 2.499074e-03
TetB 2.0 1.154748e-01 1.627663e-01 3.816936e-04
TetM 4.0 1.083125e-04 5.185259e-05 5.189226e-05
blaOXA 4.0 4.210963e-06 3.783235e-07 3.843571e-06
ermB 4.0 4.111081e-05 7.894879e-06 3.288865e-05
ermF 4.0 2.335210e-05 4.519758e-06 1.832037e-05
Basel Aph3a 4.0 7.815592e-06 1.757242e-06 5.539389e-06
IC 11.0 5.095161e-03 5.639278e-03 1.302205e-03
Int1 12.0 1.333068e+01 1.872207e+01 4.988048e-02
Sul1 11.0 1.618617e-02 1.988817e-02 2.970397e-03
I'm trying to calculate the p-value (Students t-test) for each of these samples, comparing each of the genes between them.
我正在尝试计算每个样本的 p 值(学生 t 检验),比较它们之间的每个基因。
I've used scipy.stats.ttest_ind_from_stats but I managed to get the p-values for the different samples for one gene and only those of the samples neighboring each other.
我使用过 scipy.stats.ttest_ind_from_stats 但我设法获得了一个基因的不同样本的 p 值,并且只有那些彼此相邻的样本的 p 值。
Experiments = list(values1_16S['sample'].unique())
for exp in Experiments:
if Experiments.index(exp)<len(Experiments)-1:
second = Experiments[Experiments.index(exp)+1]
else:
second = Experiments[0]
tstat, pvalue = scipy.stats.ttest_ind_from_stats(stats.loc[(exp,'Sul1')]['ARG/16S','mean'],
stats.loc[(exp,'Sul1')]['ARG/16S','std'],
stats.loc[(exp,'Sul1')]['ARG/16S','count'],
stats.loc[(second,'Sul1')]['ARG/16S','mean'],
stats.loc[(second,'Sul1')]['ARG/16S','std'],
stats.loc[(second,'Sul1')]['ARG/16S','count'])
d.append({'loc1':exp, 'loc2':second, 'pvalue':pvalue})
stats_Sul1 = pd.DataFrame(d)
stats_Sul1
How can I get the pvalues between ALL samples? And is there a way to do this for all genes at once without running the code one by one for each gene?
如何获得所有样本之间的 pvalues?有没有一种方法可以一次对所有基因执行此操作,而无需为每个基因逐一运行代码?
采纳答案by Ben.T
Let's suppose you have the same X genes for the Y samples. I try my method with X=3 and Y=2 but I guess you can generalize. I started with:
假设您对 Y 样本具有相同的 X 基因。我用 X=3 和 Y=2 尝试我的方法,但我想你可以概括。我开始于:
df1 =
count mean std min
sample gene
Arnhem IC 11 0.002319 0.000740 0.001503
Int1 11 7.243040 6.848327 1.364879
Sul1 11 0.003969 0.000919 0.002499
Basel IC 11 0.005095 0.005639 0.001302
Int1 12 13.330680 18.722070 0.049880
Sul1 11 0.016186 0.019888 0.002970
Note that the genes need to be in the same order.
First reset_index()
with df_reindex = df1.reset_index()
, I'm not sure what I'm doing is possible with multiindex:
请注意,基因需要按相同顺序排列。首先reset_index()
用df_reindex = df1.reset_index()
,我不知道我在做什么是可能的多指标:
df_reindex =
sample gene count mean std min
0 Arnhem IC 11 0.002319 0.000740 0.001503
1 Arnhem Int1 11 7.243040 6.848327 1.364879
2 Arnhem Sul1 11 0.003969 0.000919 0.002499
3 Basel IC 11 0.005095 0.005639 0.001302
4 Basel Int1 12 13.330680 18.722070 0.049880
5 Basel Sul1 11 0.016186 0.019888 0.002970
I create a rolled DF and join it to df_reindex
:
我创建了一个滚动的 DF 并将其加入df_reindex
:
nb_genes = 3
df_rolled = pd.DataFrame(pd.np.roll(df_reindex,nb_genes,0), columns = df_reindex.columns)
df_joined = df_reindex.join(df_rolled, rsuffix='_')
# rsuffix='_' is to be able to perform the join
Now on a same row, I have all data you needto calculate pvalue
and create the column with apply
:
现在在同一行,我有你需要计算pvalue
和创建列的所有数据apply
:
df_joined['pvalue'] = df_joined.apply(lambda x: stats.ttest_ind_from_stats(x['mean'],x['std'],x['count'], x['mean_'],x['std_'],x['count_'])[1],axis=1)
Finally, I create a DF with the data you want and rename columns:
最后,我用你想要的数据创建一个 DF 并重命名列:
df_output = df_joined[['sample','sample_','gene','pvalue']].rename(columns = {'sample':'loc1', 'sample_':'loc2'})
You ends up with data:
你最终得到数据:
df_output =
loc1 loc2 gene pvalue
0 Arnhem Basel IC 0.121142
1 Arnhem Basel Int1 0.321072
2 Arnhem Basel Sul1 0.055298
3 Basel Arnhem IC 0.121142
4 Basel Arnhem Int1 0.321072
5 Basel Arnhem Sul1 0.055298
That you can reindex as you need.
您可以根据需要重新索引。
If you want to do it each sample against each other, I think a loop for
could do it.
如果你想对每个样本进行对比,我认为循环for
可以做到。
EDIT:Using pivot_table
, I think there is a easier way.
编辑:使用pivot_table
,我认为有一种更简单的方法。
With your input stats
as multiindex table for only ARG/16S
(not sure how to handle this level), so I start with (which might be your stats['ARG/16S']
):
仅将您的输入stats
作为多索引表ARG/16S
(不确定如何处理此级别),所以我从(可能是您的stats['ARG/16S']
)开始:
df=
count mean std min
sample gene
Arnhem IC 11 0.002319 7.396130e-04 0.001503
Int1 11 7.243040 6.848327e+00 1.364879
Sul1 11 0.003969 9.186019e-04 0.002499
TetB 2 0.115475 1.627663e-01 0.000382
TetM 4 0.000108 5.185259e-05 0.000052
blaOXA 4 0.000004 3.783235e-07 0.000004
ermB 4 0.000041 7.894879e-06 0.000033
ermF 4 0.000023 4.519758e-06 0.000018
Basel Aph3a 4 0.000008 1.757242e-06 0.000006
IC 11 0.005095 5.639278e-03 0.001302
Int1 12 13.330680 1.872207e+01 0.049880
Sul1 11 0.016186 1.988817e-02 0.002970
With the function pivot_table
, you can rearrange your data such as:
使用函数pivot_table
,您可以重新排列数据,例如:
df_pivot = df.pivot_table(values = ['count','mean','std'], index = 'gene',
columns = 'sample', fill_value = 0)
In this df_pivot
(I don't print it here for readability but at the end with the new column), you can create a column for each couple (sample1, sample2) using itertools
and apply
:
在此df_pivot
(我不会在这里打印它以提高可读性,但在新列的末尾),您可以使用itertools
and为每对夫妇(sample1、sample2)创建一列apply
:
import itertools
for sample1, sample2 in itertools.combinations(df.index.levels[0],2):
# itertools.combinations create all combinations between your samples
df_pivot[sample1+ '_' + sample2 ] = df_pivot.apply(lambda x: stats.ttest_ind_from_stats(x['mean'][sample1],x['std'][sample1],x['count'][sample1],
x['mean'][sample2 ],x['std'][sample2 ],x['count'][sample2 ],)[1],axis=1).fillna(1)
I think this method is independent of the number of samples, genes and if genes are not all the same, you ends up with df_pivot
like:
我认为这种方法与样本数量、基因数量无关,如果基因不完全相同,您最终会得到df_pivot
如下结果:
count mean std Arnhem_Basel
sample Arnhem Basel Arnhem Basel Arnhem Basel
gene
Aph3a 0 4 0.000000 0.000008 0.000000e+00 0.000002 1.000000
IC 11 11 0.002319 0.005095 7.396130e-04 0.005639 0.121142
Int1 11 12 7.243040 13.330680 6.848327e+00 18.722070 0.321072
Sul1 11 11 0.003969 0.016186 9.186019e-04 0.019888 0.055298
TetB 2 0 0.115475 0.000000 1.627663e-01 0.000000 1.000000
TetM 4 0 0.000108 0.000000 5.185259e-05 0.000000 1.000000
blaOXA 4 0 0.000004 0.000000 3.783235e-07 0.000000 1.000000
ermB 4 0 0.000041 0.000000 7.894879e-06 0.000000 1.000000
ermF 4 0 0.000023 0.000000 4.519758e-06 0.000000 1.000000
Let me know if it works
让我知道它是否有效
EDIT2:to reply to the comment, I think you can do this:
EDIT2:回复评论,我认为你可以这样做:
No change for df_pivot
and then you create a multiindex DF df_multi
to write your results in:
没有改变df_pivot
,然后你创建一个多索引 DFdf_multi
来写入你的结果:
df_multi = pd.DataFrame(index = df.index.levels[1],
columns = pd.MultiIndex.from_tuples([p for p in itertools.combinations(df.index.levels[0],2)])).fillna(0)
Then you use the loop for
to implement the data in this df_multi
:
然后你使用循环for
来实现这个数据df_multi
:
for sample1, sample2 in itertools.combinations(df.index.levels[0],2):
# itertools.combinations create all combinations between your samples
df_multi.loc[:,(sample1,sample2)] = df_pivot.apply(lambda x: stats.ttest_ind_from_stats(x['mean'][sample1],x['std'][sample1],x['count'][sample1],
x['mean'][sample2 ],x['std'][sample2 ],x['count'][sample2 ],)[1],axis=1).fillna(1)
Finally, you can use transpose
and unstack
on level 1 to get the way you ask (or close if I misunderstood)
最后,您可以在级别 1 上使用transpose
和unstack
来获得您询问的方式(如果我误解了,请关闭)
df_output = df_multi.transpose().unstack(level=[1]).fillna(1)
You will see that you don't have the last sample in indexes and first sample in columns (because they don't exist how I built everything) if you want them, you need to replace itertools.combinations
by itertools.combinations_with_replacement
in both the creation of df_multi
and in the loop for
( I didn't try it but it should work)
你会看到,你不必在列索引中最后一个样本和第一个样品(因为它们不存在我如何建立一切),如果你想要他们,你需要更换itertools.combinations
由itertools.combinations_with_replacement
两个创作的df_multi
,并在环for
(我没试过,但应该可以)