使用 Pandas 聚合所有数据帧行对组合

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29777702/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:14:22  来源:igfitidea点击:

Aggregate all dataframe row pair combinations using pandas

pythonpandasaggregatecombinationsitertools

提问by alexhli

I use python pandas to perform grouping and aggregation across data frames, but I would like to now perform specific pairwise aggregation of rows (n choose 2, statistical combination). Here is the example data, where I would like to look at all pairs of genes in [mygenes]:

我使用 python pandas 跨数据帧执行分组和聚合,但我现在想执行特定的行的成对聚合(n 选择 2,统计组合)。这是示例数据,我想在其中查看 [mygenes] 中的所有基因对:

import pandas
import itertools

mygenes=['ABC1', 'ABC2', 'ABC3', 'ABC4']

df = pandas.DataFrame({'Gene' : ['ABC1', 'ABC2', 'ABC3', 'ABC4','ABC5'],
                       'case1'   : [0,1,1,0,0],
                       'case2'   : [1,1,1,0,1],
                       'control1':[0,0,1,1,1],
                       'control2':[1,0,0,1,0] })
>>> df
   Gene  case1  case2  control1  control2
0  ABC1      0      1         0         1
1  ABC2      1      1         0         0
2  ABC3      1      1         1         0
3  ABC4      0      0         1         1
4  ABC5      0      1         1         0

The final product should look like this (applying np.sum by default is fine):

最终产品应如下所示(默认情况下应用 np.sum 很好):

                 case1    case2    control1    control2
'ABC1', 'ABC2'    1         2         0            1
'ABC1', 'ABC3'    1         2         1            1
'ABC1', 'ABC4'    0         1         1            2
'ABC2', 'ABC3'    2         2         1            0
'ABC2', 'ABC4'    1         1         1            1
'ABC3', 'ABC4'    1         1         2            1 

The set of gene pairs can be easily obtained with itertools ($itertools.combinations(mygenes, 2)), but I can't figure out how to perform aggregation of specificrows based on their values. Can anyone advise? Thank you

使用 itertools ($ itertools.combinations(mygenes, 2))可以轻松获得这组基因对,但我无法弄清楚如何根据特定行的值进行聚合。任何人都可以建议吗?谢谢

回答by DSM

I can't think of a clever vectorized way to do this, but unless performance is a real bottleneck I tend to use the simplest thing which makes sense. In this case, I might set_index("Gene")and then use locto pick out the rows:

我想不出一个聪明的矢量化方法来做到这一点,但除非性能是真正的瓶颈,否则我倾向于使用最简单的方法。在这种情况下,我可能set_index("Gene")然后loc用来挑选行:

>>> df = df.set_index("Gene")
>>> cc = list(combinations(mygenes,2))
>>> out = pd.DataFrame([df.loc[c,:].sum() for c in cc], index=cc)
>>> out
              case1  case2  control1  control2
(ABC1, ABC2)      1      2         0         1
(ABC1, ABC3)      1      2         1         1
(ABC1, ABC4)      0      1         1         2
(ABC2, ABC3)      2      2         1         0
(ABC2, ABC4)      1      1         1         1
(ABC3, ABC4)      1      1         2         1

回答by JohnE

Before going too far, you should keep in mind your data gets big pretty fast. With 5 rows, output will be C(5,2)or 5+4+3+2+1and so on.

在走得太远之前,您应该记住您的数据会很快变大。有 5 行,输出将是C(5,2)or5+4+3+2+1等等。

That said, I'd think about doing this in numpy for speed (you may want to add a numpy tag to your question btw). Anyway, this isn't as vectorized as it might be, but ought to be a start at least:

也就是说,我会考虑在 numpy 中执行此操作以提高速度(顺便说一句,您可能想在问题中添加一个 numpy 标签)。无论如何,这并不像它可能的那样矢量化,但至少应该是一个开始:

df2 = df.set_index('Gene').loc[mygenes].reset_index()

import math
sz = len(df2)
sz2 = math.factorial(sz) / ( math.factorial(sz-2) * 2 )

Gene = df2['Gene'].tolist()
abc = df2.ix[:,1:].values

import math
arr = np.zeros([sz2,4])
gene2 = []
k = 0

for i in range(sz):
    for j in range(sz):
        if i != j and i < j:
            gene2.append( gene[i] + gene[j] )
            arr[k] = abc[i] + abc[j]
            k += 1

pd.concat( [ pd.DataFrame(gene2), pd.DataFrame(arr) ], axis=1 )
Out[1780]: 
          0  0  1  2  3
0  ABC1ABC2  1  2  0  1
1  ABC1ABC3  1  2  1  1
2  ABC1ABC4  0  1  1  2
3  ABC2ABC3  2  2  1  0
4  ABC2ABC4  1  1  1  1
5  ABC3ABC4  1  1  2  1

Depending on size/speed issues you may need to separate the string and numerical code and vectorize the numerical piece. This code is not likely to scale all that well if your data is big and if it is, that may determine what sort of answer you need (and also may need to think about how you store results).

根据大小/速度问题,您可能需要将字符串和数字代码分开并对数字部分进行矢量化。如果您的数据很大,则此代码不太可能很好地扩展,如果数据很大,这可能会决定您需要什么样的答案(并且还可能需要考虑如何存储结果)。