Pandas Pivot_Table :非数字值的行计算百分比

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31064752/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:32:08  来源:igfitidea点击:

Pandas Pivot_Table : Percentage of row calculation for non-numeric values

pythonpandas

提问by keerthi kumar

This is my DATA in dataframe "df":

这是我在数据帧“df”中的数据:

Document    Name    Time
SPS2315511  A   1 HOUR
SPS2315512  B   1 - 2 HOUR
SPS2315513  C   2 - 3 HOUR
SPS2315514  C   1 HOUR
SPS2315515  B   1 HOUR
SPS2315516  A   2 - 3 HOUR
SPS2315517  A   1 - 2 HOUR

I am using the below code which gives me the summary of count in the pivot table,

我正在使用下面的代码,它为我提供了数据透视表中的计数摘要,

table = pivot_table(df, values=["Document"],
                    index=["Name"], columns=["Time"],
                    aggfunc=lambda x: len(x),
                    margins=True, dropna=True)

but what i want is the % of row calculation as in excel pivot when you right click the pivot and select "show value as -> % of Row Total" . Since my Document is a non-numeric value i was not able to get it.

但我想要的是当您右键单击数据透视表并选择“将值显示为 -> 行总数的百分比”时,在 excel 数据透视表中的行计算百分比。由于我的 Document 是一个非数字值,我无法获得它。

EXPECTED RESULT :

预期结果 :

Count of Document   Column Labels

Name    1 HOUR  1 - 2 HOUR  2 - 3 HOUR  Grand Total
A   33.33%  33.33%  33.33%  100.00%
B   50.00%  50.00%  0.00%   100.00%
C   50.00%  0.00%   50.00%  100.00%
Grand Total 42.86%  28.57%  28.57%  100.00%

Can any one please help me figure out a way to get this result??

任何人都可以帮我想出一种方法来获得这个结果吗?

i am trying to manipulate the pivot data which will give me the row total,not the data from the dataframe and what i wanted is "% of row total". And also most importantly all my data are non-numeric values...

我正在尝试操纵数据透视数据,它会给我行总数,而不是来自数据帧的数据,我想要的是“行总数的百分比”。而且最重要的是我所有的数据都是非数字值......

回答by JohnE

The possible duplicate noted by @maxymoo is pretty close to a solution, but I'll go ahead and write it up as an answer since there are a couple of differences that are not completely straightforward.

@maxymoo 指出的可能重复非常接近解决方案,但我会继续将其写下来作为答案,因为存在一些并非完全直接的差异。

table = pd.pivot_table(df, values=["Document"],
                       index=["Name"], columns=["Time"], 
                       aggfunc=len, margins=True, 
                       dropna=True, fill_value=0)

       Document                      
Time 1 - 2 HOUR 1 HOUR 2 - 3 HOUR All
Name                                 
A             1      1          1   3
B             1      1          0   2
C             0      1          1   2
All           2      3          2   7

The main tweak there is to add fill_value=0because what you really want there is a count value of zero, not a NaN.

主要的调整是添加,fill_value=0因为您真正想要的是计数值为零,而不是 NaN。

Then you can basically use the solution @maxymoo linked to, but you need to use ilocor similar b/c the table columns are a little complicated now (being a multi-indexed result of the pivot table).

然后您基本上可以使用链接到的解决方案@maxymoo,但是您需要使用iloc或类似的 b/c 表列现在有点复杂(作为数据透视表的多索引结果)。

table2 = table.div( table.iloc[:,-1], axis=0 )

       Document                         
Time 1 - 2 HOUR    1 HOUR 2 - 3 HOUR All
Name                                    
A      0.333333  0.333333   0.333333   1
B      0.500000  0.500000   0.000000   1
C      0.000000  0.500000   0.500000   1
All    0.285714  0.428571   0.285714   1

You've still got some minor formatting work to do there (flip first and second columns and convert to %), but those are the numbers you are looking for.

您还有一些小的格式化工作要做(翻转第一列和第二列并转换为 %),但这些就是您要查找的数字。

Btw, it's not necessary here, but you might want to think about converting 'Time' to an ordered categorical variable, which would be one way to solve the column ordering problem (I think), but may or may not be worth the bother depending on what else you are doing with the data.

顺便说一句,这里没有必要,但您可能想考虑将“时间”转换为有序分类变量,这将是解决列排序问题的一种方法(我认为),但可能值得也可能不值得打扰关于您对数据的其他处理。