pandas groupby 并计算平均值但保留所有列

Question

提问by BioProgram

I would like to calculate the mean of the numeric columns using groupby but maintaining all columns. An example of a dataframe from 7 columns is here:

我想使用 groupby 但保留所有列来计算数字列的平均值。来自 7 列的数据框示例如下：

tracking_id gene_id gene_short_name tss_id  locus   FPKM-1  FPKM-2
ENSMUSG00000025902  ENSMUSG00000025902  Sox17   Tss1231 1:4490927-4496413   0.611985    232
ENSMUSG00000025902  ENSMUSG00000025902  Sox17   Ts412   1:4490927-4496413   12  21
ENSMUSG00000025902  ENSMUSG00000025902  Sox17   Ts56    1:4490927-4496413   2   213
ENSMUSG00000025902  ENSMUSG00000025902  Sox17   TS512   1:4490927-4496413   0.611985    5
ENSMUSG00000025902  ENSMUSG00000025902  Sox17   TS12241 1:4490927-4496413   0.611985    51
ENSMUSG00000096126  ENSMUSG00000096126  Gm22307 TS124   1:4529016-4529123   35  1
ENSMUSG00000096126  ENSMUSG00000096126  Gm22307 TS-1824 1:4529016-4529123   1   2
ENSMUSG00000096126  ENSMUSG00000096126  Gm22307 TS1249082   1:4529016-4529123   2   5
ENSMUSG00000088000  ENSMUSG00000088000  Gm25493 TS1290328   1:4723276-4723379   0   1
ENSMUSG00000098104  ENSMUSG00000098104  Gm6085  TS01239-1   1:4687933-4689403   0.0743559   6
ENSMUSG00000033845  ENSMUSG00000033845  Mrpl15  TSS31014,TSS82987,TSS82990,TSS86849 1:4773205-4785739   79.1154 7
ENSMUSG00000093015  ENSMUSG00000093015  Gm22463 TSS79849    1:5644644-5644745   0   1
ENSMUSG00000025905  ENSMUSG00000025905  Oprk1   TSS15316,TSS3878,TSS6226,TSS65522   1:5588492-5606131   0   6
ENSMUSG00000033774  ENSMUSG00000033774  Npbwr1  TSS69693    1:5913706-5917398   0   8
ENSMUSG00000033793  ENSMUSG00000033793  Atp6v1h TSS4651 1:5083172-5162549   24.2386 9
ENSMUSG00000087247  ENSMUSG00000087247  Fam150a TSS42747    1:6359330-6394731   0.502804    1

I would like to group by the first 3 columns, and keep columns 4 and 5 in my output (best would be the first row of each repeated columns 1 to 3) and then calculate the mean of the numeric columns at the end. I have written this:

我想按前 3 列分组，并将第 4 列和第 5 列保留在我的输出中（最好是每个重复列 1 到 3 的第一行），然后计算最后数字列的平均值。我写过这个：

import pandas as pd
df = pd.read_table('grouping.txt')
grouped = df.groupby(list(df.columns[0:3]), sort=False).mean()

The output is:

输出是：

tracking_id gene_id gene_short_name FPKM-1  FPKM-2
ENSMUSG00000025902  ENSMUSG00000025902  Sox17   3.167191    104.4
ENSMUSG00000096126  ENSMUSG00000096126  Gm22307 12.66666667 2.666666667
ENSMUSG00000088000  ENSMUSG00000088000  Gm25493 0   1
ENSMUSG00000098104  ENSMUSG00000098104  Gm6085  0.0743559   6
ENSMUSG00000033845  ENSMUSG00000033845  Mrpl15  79.1154 7
ENSMUSG00000093015  ENSMUSG00000093015  Gm22463 0   1
ENSMUSG00000025905  ENSMUSG00000025905  Oprk1   0   6
ENSMUSG00000033774  ENSMUSG00000033774  Npbwr1  0   8
ENSMUSG00000033793  ENSMUSG00000033793  Atp6v1h 24.2386 9
ENSMUSG00000087247  ENSMUSG00000087247  Fam150a 0.502804    1

The above is an output but is missing columns 4 (TSS) and 5 (locus) of the input file. How can I retain those 2 columns (their values are different and hence cant be part of the groupby columns. Keeping ANY of the values of the columns is OK with me, as long as one of the grouped by is there).

以上是一个输出，但缺少输入文件的第 4 列（TSS）和第 5 列（轨迹）。我如何保留这 2 列（它们的值不同，因此不能成为 groupby 列的一部分。保留列的任何值对我来说都可以，只要分组依据之一在那里）。

Answer 1

回答by SPKoder

You can merge the results of your groupby() aggregation back into a deduped version of your original DataFrame. Perhaps something like this:

您可以将 groupby() 聚合的结果合并回原始 DataFrame 的重复数据删除版本。也许是这样的：

# identify the columns we want to aggregate by; this could
# equivalently be defined as list(df.columns[0:3])
group_cols = ['tracking_id', 'gene_id', 'gene_short_name']
# identify the columns which we want to average; this could
# equivalently be defined as list(df.columns[4:])
metric_cols = ['FPKM-1', 'FPKM-2']

# create a new DataFrame with a MultiIndex consisting of the group_cols
# and a column for the mean of each column in metric_cols
aggs = df.groupby(group_cols)[metric_cols].mean()
# remove the metric_cols from df because we are going to replace them
# with the means in aggs
df.drop(metric_cols, axis=1, inplace=True)
# dedupe to leave only one row with each combination of group_cols
# in df
df.drop_duplicates(subset=group_cols, keep='last', inplace=True)
# add the mean columns from aggs into df
df = df.merge(right=aggs, right_index=True, left_on=group_cols, how='right')

Answer 2

回答by SPKoder

You can use aggregation, with a dict of functionsto apply to each column. I'm using lambdas and the string version of Pandas (dataframe) functions, so that Pandas will pick up mean()automatically.

您可以使用聚合，将函数字典应用于每一列。我正在使用lambdas 和 Pandas（数据帧）函数的字符串版本，以便 Pandas 会mean()自动拾取。

grouped = df.groupby(list(df.columns[0:3]), sort=False).agg(
    {'FPKM-1': 'mean', 'FPKM-2': 'mean',
     'tss_id': lambda x: x.iloc[0], 'locus': lambda x: x.iloc[0]})
print(grouped)

gives:

给出：

tracking_id        gene_id            gene_short_name                               tss_id     FPKM-1      FPKM-2              locus
ENSMUSG00000025902 ENSMUSG00000025902 Sox17                                        Tss1231   3.167191  104.400000  1:4490927-4496413
ENSMUSG00000096126 ENSMUSG00000096126 Gm22307                                        TS124  12.666667    2.666667  1:4529016-4529123
ENSMUSG00000088000 ENSMUSG00000088000 Gm25493                                    TS1290328   0.000000    1.000000  1:4723276-4723379
ENSMUSG00000098104 ENSMUSG00000098104 Gm6085                                     TS01239-1   0.074356    6.000000  1:4687933-4689403
ENSMUSG00000033845 ENSMUSG00000033845 Mrpl15           TSS31014,TSS82987,TSS82990,TSS86849  79.115400    7.000000  1:4773205-4785739
ENSMUSG00000093015 ENSMUSG00000093015 Gm22463                                     TSS79849   0.000000    1.000000  1:5644644-5644745
ENSMUSG00000025905 ENSMUSG00000025905 Oprk1              TSS15316,TSS3878,TSS6226,TSS65522   0.000000    6.000000  1:5588492-5606131
ENSMUSG00000033774 ENSMUSG00000033774 Npbwr1                                      TSS69693   0.000000    8.000000  1:5913706-5917398
ENSMUSG00000033793 ENSMUSG00000033793 Atp6v1h                                      TSS4651  24.238600    9.000000  1:5083172-5162549
ENSMUSG00000087247 ENSMUSG00000087247 Fam150a                                     TSS42747   0.502804    1.000000  1:6359330-6394731

pandas groupby 并计算平均值但保留所有列

提问by BioProgram

回答by SPKoder

回答by SPKoder

相关推荐

最近更新

标签

pandas groupby 并计算平均值但保留所有列

提问by BioProgram

回答by SPKoder

回答by SPKoder

相关推荐

pandas 运行中型合并功能 ipython notebook jupyter 时出现内存错误

pandas 如何根据值列表选择熊猫中的行

pandas 熊猫将数据框绘制为多个条形图

pandas 在 jupyter ipython notebook 上导入熊猫失败

相关推荐

最近更新

标签