Python 熊猫占 groupby 总数的百分比

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23377108/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:52:14  来源:igfitidea点击:

Pandas percentage of total with groupby

pythonpandas

提问by erikcw

This is obviously simple, but as a numpy newbe I'm getting stuck.

这显然很简单,但作为一个麻木的新手,我被卡住了。

I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office.

我有一个 CSV 文件,其中包含 3 列、州、办公室 ID 和该办公室的销售额。

I want to calculate the percentage of sales per office in a given state (total of all percentages in each state is 100%).

我想计算给定州每个办公室的销售额百分比(每个州的所有百分比总和为 100%)。

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': range(1, 7) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})

df.groupby(['state', 'office_id']).agg({'sales': 'sum'})

This returns:

这将返回:

                  sales
state office_id        
AZ    2          839507
      4          373917
      6          347225
CA    1          798585
      3          890850
      5          454423
CO    1          819975
      3          202969
      5          614011
WA    2          163942
      4          369858
      6          959285

I can't seem to figure out how to "reach up" to the statelevel of the groupbyto total up the salesfor the entire stateto calculate the fraction.

我似乎无法弄清楚如何“高达”的state水平groupby与总起来sales对整个state计算分数。

采纳答案by exp1orer

Paul H's answeris right that you will have to make a second groupbyobject, but you can calculate the percentage in a simpler way -- just groupbythe state_officeand divide the salescolumn by its sum. Copying the beginning of Paul H's answer:

保罗轰的答案是正确的,你将不得不作出第二个groupby对象,但你可以在一个简单的方式计算百分比-只是groupbystate_office和划分sales其总和列。复制 Paul H 回答的开头:

# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': list(range(1, 7)) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
                                                 100 * x / float(x.sum()))

Returns:

返回:

                     sales
state office_id           
AZ    2          16.981365
      4          19.250033
      6          63.768601
CA    1          19.331879
      3          33.858747
      5          46.809373
CO    1          36.851857
      3          19.874290
      5          43.273852
WA    2          34.707233
      4          35.511259
      6          29.781508

回答by Paul H

You need to make a second groupby object that groups by the states, and then use the divmethod:

您需要创建按状态分组的第二个 groupby 对象,然后使用该div方法:

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100


                     sales
state office_id           
AZ    2          16.981365
      4          19.250033
      6          63.768601
CA    1          19.331879
      3          33.858747
      5          46.809373
CO    1          36.851857
      3          19.874290
      5          43.273852
WA    2          34.707233
      4          35.511259
      6          29.781508

the level='state'kwarg in divtells pandas to broadcast/join the dataframes base on the values in the statelevel of the index.

level='state'在kwargdiv告诉大熊猫广播/加入关于该值的dataframes基地state索引的水平。

回答by iggy

You can sumthe whole DataFrameand divide by the statetotal:

您可以sum将整体DataFrame除以state总数:

# Copying setup from Paul H answer
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
# Add a column with the sales divided by state total sales.
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']

df

Returns

退货

    office_id   sales state  sales_ratio
0           1  405711    CA     0.193319
1           2  535829    WA     0.347072
2           3  217952    CO     0.198743
3           4  252315    AZ     0.192500
4           5  982371    CA     0.468094
5           6  459783    WA     0.297815
6           1  404137    CO     0.368519
7           2  222579    AZ     0.169814
8           3  710581    CA     0.338587
9           4  548242    WA     0.355113
10          5  474564    CO     0.432739
11          6  835831    AZ     0.637686

But note that this only works because all columns other than stateare numeric, enabling summation of the entire DataFrame. For example, if office_idis character instead, you get an error:

但请注意,这仅适用于除state数字以外的所有列,从而可以对整个 DataFrame 求和。例如,如果office_id是字符,则会出现错误:

df.office_id = df.office_id.astype(str)
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']

TypeError: unsupported operand type(s) for /: 'str' and 'str'

类型错误:不支持 / 的操作数类型:'str' 和 'str'

回答by Radical Edward

I know that this is an old question, but exp1orer'sanswer is very slow for datasets with a large number unique groups (probably because of the lambda). I built off of their answer to turn it into an array calculation so now it's super fast! Below is the example code:

我知道这是一个老问题,但是对于具有大量唯一组的数据集(可能是因为 lambda),exp1orer 的回答非常慢。我根据他们的答案将其转换为数组计算,因此现在速度非常快!下面是示例代码:

Create the test dataframe with 50,000 unique groups

创建具有 50,000 个唯一组的测试数据框

import random
import string
import pandas as pd
import numpy as np
np.random.seed(0)

# This is the total number of groups to be created
NumberOfGroups = 50000

# Create a lot of groups (random strings of 4 letters)
Group1     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10
Group2     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2
FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)]

# Make the numbers
NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)]

# Make the dataframe
df = pd.DataFrame({'Group 1': Group1,
                   'Group 2': Group2,
                   'Final Group': FinalGroup,
                   'Numbers I want as percents': NumbersForPercents})

When grouped it looks like:

分组后,它看起来像:

                             Numbers I want as percents
Group 1 Group 2 Final Group                            
AAAH    AQYR    RMCH                                847
                XDCL                                182
        DQGO    ALVF                                132
                AVPH                                894
        OVGH    NVOO                                650
                VKQP                                857
        VNLY    HYFW                                884
                MOYH                                469
        XOOC    GIDS                                168
                HTOY                                544
AACE    HNXU    RAXK                                243
                YZNK                                750
        NOYI    NYGC                                399
                ZYCI                                614
        QKGK    CRLF                                520
                UXNA                                970
        TXAR    MLNB                                356
                NMFJ                                904
        VQYG    NPON                                504
                QPKQ                                948
...
[50000 rows x 1 columns]

Array method of finding percentage:

查找百分比的数组方法:

# Initial grouping (basically a sorted version of df)
PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index()
# Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index)
SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index()
# Merge the two dataframes
Percents_df = pd.merge(PreGroupby_df, SumGroup_df)
# Divide the two columns
Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100
# Drop the extra _Sum column
Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)

This method takes about ~0.15 seconds

这种方法需要大约 0.15 秒

Top answer method (using lambda function):

最佳答案方法(使用 lambda 函数):

state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'})
state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))

This method takes about ~21 seconds to produce the same result.

此方法需要大约 21 秒才能产生相同的结果。

The result:

结果:

      Group 1 Group 2 Final Group  Numbers I want as percents  Percent of Final Group
0        AAAH    AQYR        RMCH                         847               82.312925
1        AAAH    AQYR        XDCL                         182               17.687075
2        AAAH    DQGO        ALVF                         132               12.865497
3        AAAH    DQGO        AVPH                         894               87.134503
4        AAAH    OVGH        NVOO                         650               43.132050
5        AAAH    OVGH        VKQP                         857               56.867950
6        AAAH    VNLY        HYFW                         884               65.336290
7        AAAH    VNLY        MOYH                         469               34.663710
8        AAAH    XOOC        GIDS                         168               23.595506
9        AAAH    XOOC        HTOY                         544               76.404494

回答by Andy Hayden

For conciseness I'd use the SeriesGroupBy:

为简洁起见,我将使用 SeriesGroupBy:

In [11]: c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")

In [12]: c
Out[12]:
state  office_id
AZ     2            925105
       4            592852
       6            362198
CA     1            819164
       3            743055
       5            292885
CO     1            525994
       3            338378
       5            490335
WA     2            623380
       4            441560
       6            451428
Name: count, dtype: int64

In [13]: c / c.groupby(level=0).sum()
Out[13]:
state  office_id
AZ     2            0.492037
       4            0.315321
       6            0.192643
CA     1            0.441573
       3            0.400546
       5            0.157881
CO     1            0.388271
       3            0.249779
       5            0.361949
WA     2            0.411101
       4            0.291196
       6            0.297703
Name: count, dtype: float64

For multiple groups you have to use transform (using Radical's df):

对于多个组,您必须使用转换(使用Radical 的 df):

In [21]: c =  df.groupby(["Group 1","Group 2","Final Group"])["Numbers I want as percents"].sum().rename("count")

In [22]: c / c.groupby(level=[0, 1]).transform("sum")
Out[22]:
Group 1  Group 2  Final Group
AAHQ     BOSC     OWON           0.331006
                  TLAM           0.668994
         MQVF     BWSI           0.288961
                  FXZM           0.711039
         ODWV     NFCH           0.262395
...
Name: count, dtype: float64

This seems to be slightly more performant than the other answers (just less than twice the speed of Radical's answer, for me ~0.08s).

这似乎比其他答案的性能略高(仅比 Radical 答案的速度快两倍,对我来说大约 0.08 秒)。

回答by upliftedLemur

Simple way I have used is a merge after the 2 groupby's then doing simple division.

我使用的简单方法是在 2 个 groupby 之后进行简单的除法。

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

state_office = df.groupby(['state', 'office_id'])['sales'].sum().reset_index()
state = df.groupby(['state'])['sales'].sum().reset_index()
state_office = state_office.merge(state, left_on='state', right_on ='state', how = 'left')
state_office['sales_ratio'] = 100*(state_office['sales_x']/state_office['sales_y'])

   state  office_id  sales_x  sales_y  sales_ratio
0     AZ          2   222579  1310725    16.981365
1     AZ          4   252315  1310725    19.250033
2     AZ          6   835831  1310725    63.768601
3     CA          1   405711  2098663    19.331879
4     CA          3   710581  2098663    33.858747
5     CA          5   982371  2098663    46.809373
6     CO          1   404137  1096653    36.851857
7     CO          3   217952  1096653    19.874290
8     CO          5   474564  1096653    43.273852
9     WA          2   535829  1543854    34.707233
10    WA          4   548242  1543854    35.511259
11    WA          6   459783  1543854    29.781508

回答by louisD

I think this would do the trick in 1 line:

我认为这可以在 1 行中解决问题:

df.groupby(['state', 'office_id']).sum().transform(lambda x: x/np.sum(x)*100)

回答by Little Bobby Tables

I think this needs benchmarking. Using OP's original DataFrame,

我认为这需要进行基准测试。使用 OP 的原始 DataFrame,

df = pd.DataFrame({
    'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
    'office_id': range(1, 7) * 2,
    'sales': [np.random.randint(100000, 999999) for _ in range(12)]
})


1st Andy Hayden

1安迪·海登

As commented on his answer, Andy takes full advantage of vectorisation and pandas indexing.

正如对他的回答所评论的那样,安迪充分利用了矢量化和熊猫索引。

c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
c / c.groupby(level=0).sum()

3.42 ms± 16.7 μs per loop
(mean ± std. dev. of 7 runs, 100 loops each)

每个循环3.42 ms± 16.7 μs
(7 次运行的平均值 ± 标准偏差,每次 100 次循环)



2nd Paul H

2保罗 H

state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100

4.66 ms± 24.4 μs per loop
(mean ± std. dev. of 7 runs, 100 loops each)

每个循环4.66 ms± 24.4 μs
(7 次运行的平均值 ± 标准偏差,每次 100 次循环)



3rd exp1orer

第三次探索者

This is the slowest answer as it calculates x.sum()for each xin level 0.

这是最慢的答案,因为它x.sum()x0 级中的每个计算。

For me, this is still a useful answer, though not in its current form. For quick EDA on smaller datasets, applyallows you use method chainingto write this in a single line. We therefore remove the need decide on a variable's name, which is actually very computationally expensivefor your most valuable resource (your brain!!).

对我来说,这仍然是一个有用的答案,尽管不是目前的形式。对于较小数据集的快速 EDA,apply允许您使用方法链将其写在一行中。因此,我们不需要决定变量的名称,这对于您最宝贵的资源(您的大脑!!)来说实际​​上是非常昂贵计算成本

Here is the modification,

这里是修改,

(
    df.groupby(['state', 'office_id'])
    .agg({'sales': 'sum'})
    .groupby(level=0)
    .apply(lambda x: 100 * x / float(x.sum()))
)

10.6 ms± 81.5 μs per loop
(mean ± std. dev. of 7 runs, 100 loops each)

每个循环10.6 ms± 81.5 μs
(7 次运行的平均值 ± 标准偏差,每次 100 次循环)



So no one is going care about 6ms on a small dataset. However, this is 3x speed up and, on a larger dataset with high cardinality groupbys this is going to make a massive difference.

所以没有人会在意一个小数据集上的 6 毫秒。然而,这是 3 倍的加速,并且在具有高基数 groupbys 的更大数据集上,这将产生巨大的差异。

Adding to the above code, we make a DataFrame with shape (12,000,000, 3) with 14412 state categories and 600 office_ids,

添加到上面的代码中,我们制作了一个形状为 (12,000,000, 3) 的 DataFrame,其中包含 14412 个状态类别和 600 个 office_id,

import string

import numpy as np
import pandas as pd
np.random.seed(0)

groups = [
    ''.join(i) for i in zip(
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
                       )
]

df = pd.DataFrame({'state': groups * 400,
               'office_id': list(range(1, 601)) * 20000,
               'sales': [np.random.randint(100000, 999999)
                         for _ in range(12)] * 1000000
})

Using Andy's,

使用安迪的,

2 s± 10.4 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)

每个循环2 s± 10.4 ms
(7 次运行的平均值 ± 标准偏差,每个循环 1 次)

and exp1orer

和探险家

19 s± 77.1 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)

每个循环19 s± 77.1 ms
(7 次运行的平均值 ± 标准偏差,每次 1 次循环)

So now we see x10 speed up on large, high cardinality datasets.

所以现在我们看到 x10 在大型、高基数数据集上加速。



Be sure to UV these three answers if you UV this one!!

如果你 UV 这一个,一定要 UV 这三个答案!

回答by Alessandro

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999)
                         for _ in range(12)]})

grouped = df.groupby(['state', 'office_id'])
100*grouped.sum()/df[["state","sales"]].groupby('state').sum()

Returns:

返回:

sales
state   office_id   
AZ  2   54.587910
    4   33.009225
    6   12.402865
CA  1   32.046582
    3   44.937684
    5   23.015735
CO  1   21.099989
    3   31.848658
    5   47.051353
WA  2   43.882790
    4   10.265275
    6   45.851935

回答by ajknzhol

The most elegant way to find percentages across columns or index is to use pd.crosstab.

查找跨列或索引的百分比的最优雅方法是使用pd.crosstab.

Sample Data

样本数据

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

The output dataframe is like this

输出数据帧是这样的

print(df)

        state   office_id   sales
    0   CA  1   764505
    1   WA  2   313980
    2   CO  3   558645
    3   AZ  4   883433
    4   CA  5   301244
    5   WA  6   752009
    6   CO  1   457208
    7   AZ  2   259657
    8   CA  3   584471
    9   WA  4   122358
    10  CO  5   721845
    11  AZ  6   136928

Just specify the index, columns and the values to aggregate. The normalize keyword will calculate % across index or columns depending upon the context.

只需指定要聚合的索引、列和值。normalize 关键字将根据上下文计算跨索引或列的百分比。

result = pd.crosstab(index=df['state'], 
                     columns=df['office_id'], 
                     values=df['sales'], 
                     aggfunc='sum', 
                     normalize='index').applymap('{:.2f}%'.format)




print(result)
office_id   1   2   3   4   5   6
state                       
AZ  0.00%   0.20%   0.00%   0.69%   0.00%   0.11%
CA  0.46%   0.00%   0.35%   0.00%   0.18%   0.00%
CO  0.26%   0.00%   0.32%   0.00%   0.42%   0.00%
WA  0.00%   0.26%   0.00%   0.10%   0.00%   0.63%