pandas 熊猫:.groupby().size() 和百分比

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23627782/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:02:42  来源:igfitidea点击:

Pandas: .groupby().size() and percentages

pythonpandasbioinformatics

提问by erikfas

I have a DataFrame that originates from a df.groupby().size()operation, and looks like this:

我有一个源自df.groupby().size()操作的 DataFrame,如下所示:

Localization                           RNA level      
cytoplasm                              1 Non-expressed     7
                                       2 Very low         13
                                       3 Low               8
                                       4 Medium            6
                                       5 Moderate          8
                                       6 High              2
                                       7 Very high         6
cytoplasm & nucleus                    1 Non-expressed     5
                                       2 Very low          8
                                       3 Low               2
                                       4 Medium           10
                                       5 Moderate         16
                                       6 High              6
                                       7 Very high         5
cytoplasm & nucleus & plasma membrane  1 Non-expressed     6
                                       2 Very low          3
                                       3 Low               3
                                       4 Medium            7
                                       5 Moderate          8
                                       6 High              4
                                       7 Very high         1

What I want to do is to calculate the separate occurrences (i.e. the last column coming from .size()) as a percentage of the total number of occurrences in the applicable Localization.

我想要做的是计算单独的出现次数(即来自 的最后一列.size())作为适用的Localization.

For example: there are a total of 50 occurrences in the cytoplasmlocalisation (7 + 13 + 8 + 6 + 8 + 2 + 6), yielding 14 and 26 % for the Non-expressedand Very lowRNA-levels, respectively.

例如:cytoplasm定位中共有 50 次出现(7 + 13 + 8 + 6 + 8 + 2 + 6) ,分别为Non-expressedVery lowRNA 水平产生 14% 和 26% 。

Is there a nice way of doing this? I've been going about it with what I think is a very roundabout way, i.e. making a new DataFrame for every Localizationand working on from there, but there's a lot of lines and the problem of having to merge all the resulting DataFrames in the end. I'm hoping there's a smarter way of doing it, at least!

有没有很好的方法来做到这一点?我一直在用一种我认为非常迂回的方式来处理它,即为每个创建一个新的 DataFrameLocalization并从那里开始工作,但是有很多行以及最终必须合并所有生成的 DataFrame 的问题. 我希望至少有一种更聪明的方法!

回答by Guillaume Jacquenot

Here is the complete example based on pandas groupby, sumfunctions. The basic idea is to group data based on 'Localization'and to apply a function on group.

这是基于Pandas完整的示例groupbysum函数。基本思想是基于组对数据进行分组'Localization'并在组上应用函数。

import pandas as pd
from StringIO import StringIO
#For Python 3: from io import StringIO

data = \
"""Localization,RNA level,Size
cytoplasm                            ,1 Non-expressed, 7
cytoplasm                            ,2 Very low     ,13
cytoplasm                            ,3 Low          , 8
cytoplasm                            ,4 Medium       , 6
cytoplasm                            ,5 Moderate     , 8
cytoplasm                            ,6 High         , 2
cytoplasm                            ,7 Very high    , 6
cytoplasm & nucleus                  ,1 Non-expressed, 5
cytoplasm & nucleus                  ,2 Very low     , 8
cytoplasm & nucleus                  ,3 Low          , 2
cytoplasm & nucleus                  ,4 Medium       ,10
cytoplasm & nucleus                  ,5 Moderate     ,16
cytoplasm & nucleus                  ,6 High         , 6
cytoplasm & nucleus                  ,7 Very high    , 5
cytoplasm & nucleus & plasma membrane,1 Non-expressed, 6
cytoplasm & nucleus & plasma membrane,2 Very low     , 3
cytoplasm & nucleus & plasma membrane,3 Low          , 3
cytoplasm & nucleus & plasma membrane,4 Medium       , 7
cytoplasm & nucleus & plasma membrane,5 Moderate     , 8
cytoplasm & nucleus & plasma membrane,6 High         , 4
cytoplasm & nucleus & plasma membrane,7 Very high    , 1"""

# Create the dataframe
df = pd.read_csv(StringIO(data))
df['Localization'].str.strip()
df['RNA level'].str.strip()
df['Size'].astype(int)
df['Percent'] = df.groupby('Localization')['Size'].transform(lambda x: x/sum(x))