Python Pandas:pivot 和pivot_table 之间的区别。为什么只有pivot_table 工作?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30960338/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:14:05  来源:igfitidea点击:

Pandas: Difference between pivot and pivot_table. Why is only pivot_table working?

pythonpandaspivot

提问by jwillis0720

I have the following dataframe.

我有以下数据框。

df.head(30)

     struct_id  resNum score_type_name  score_value
0   4294967297       1           omega     0.064840
1   4294967297       1          fa_dun     2.185618
2   4294967297       1      fa_dun_dev     0.000027
3   4294967297       1     fa_dun_semi     2.185591
4   4294967297       1             ref    -1.191180
5   4294967297       2            rama    -0.795161
6   4294967297       2           omega     0.222345
7   4294967297       2          fa_dun     1.378923
8   4294967297       2      fa_dun_dev     0.028560
9   4294967297       2      fa_dun_rot     1.350362
10  4294967297       2         p_aa_pp    -0.442467
11  4294967297       2             ref     0.249477
12  4294967297       3            rama     0.267443
13  4294967297       3           omega     0.005106
14  4294967297       3          fa_dun     0.020352
15  4294967297       3      fa_dun_dev     0.025507
16  4294967297       3      fa_dun_rot    -0.005156
17  4294967297       3         p_aa_pp    -0.096847
18  4294967297       3             ref     0.979644
19  4294967297       4            rama    -1.403292
20  4294967297       4           omega     0.212160
21  4294967297       4          fa_dun     4.218029
22  4294967297       4      fa_dun_dev     0.003712
23  4294967297       4     fa_dun_semi     4.214317
24  4294967297       4         p_aa_pp    -0.462765
25  4294967297       4             ref    -1.960940
26  4294967297       5            rama    -0.600053
27  4294967297       5           omega     0.061867
28  4294967297       5          fa_dun     3.663050
29  4294967297       5      fa_dun_dev     0.004953

According to the pivot documentation, I should be able to reshape this on the score_type_name using the pivot function.

根据枢轴文档,我应该能够使用枢轴函数在 score_type_name 上对其进行重塑。

df.pivot(columns='score_type_name',values='score_value',index=['struct_id','resNum'])

But, I get the following.

但是,我得到以下信息。

enter image description here

在此处输入图片说明

However, pivot_table function seems to work:

然而,pivot_table 函数似乎工作:

pivoted = df.pivot_table(columns='score_type_name',
                         values='score_value',
                         index=['struct_id','resNum'])

enter image description here

在此处输入图片说明

But it does not lend itself, for me atleast, to further analysis. I want it to just have the struct_id, resNum, and score_type_name as columns instead of stacking the score_type_name on top of the other columns. Additionally, I want the struct_id to be for every row, and not aggregate in a joined row like it does for the table.

但至少对我来说,它不适合进一步分析。我希望它只将 struct_id、resNum 和 score_type_name 作为列,而不是将 score_type_name 堆叠在其他列的顶部。此外,我希望 struct_id 适用于每一行,而不是像对表那样聚合在连接的行中。

So can anyone tell me how I can get a nice Dataframe like I want using pivot? Additionally, from the documentation, I can't tell why pivot_table works and pivot doesn't. If I look at the first example of pivot, it looks like exactly what I need.

那么谁能告诉我如何获得像我想要的那样使用数据透视表的漂亮数据框?此外,从文档中,我不知道为什么 pivot_table 起作用而 pivot 不起作用。如果我查看第一个枢轴示例,它看起来正是我所需要的。

P.S. I did post a question in reference to this problem, but I did such a poor job of demonstrating the output, I deleted it and tried again using ipython notebook. I apologize in advance if you are seeing this twice.

PS我确实发布了一个关于这个问题的问题,但是我在演示输出方面做得很差,我删除了它并使用ipython notebook再次尝试。如果您看到两次,我提前道歉。

Here is the notebook for your full reference

这是笔记本供您完整参考

EDIT - My desired results would look like this (made in excel):

编辑 - 我想要的结果看起来像这样(用 excel 制作):

StructId    resNum  pdb_residue_number  chain_id    name3   fa_dun  fa_dun_dev  fa_dun_rot  fa_dun_semi omega   p_aa_pp rama    ref
4294967297  1   99  A   ASN 2.1856  0.0000      2.1856  0.0648          -1.1912
4294967297  2   100 A   MET 1.3789  0.0286  1.3504      0.2223  -0.4425 -0.7952 0.2495
4294967297  3   101 A   VAL 0.0204  0.0255  -0.0052     0.0051  -0.0968 0.2674  0.9796
4294967297  4   102 A   GLU 4.2180  0.0037      4.2143  0.2122  -0.4628 -1.4033 -1.9609
4294967297  5   103 A   GLN 3.6630  0.0050      3.6581  0.0619  -0.2759 -0.6001 -1.5172
4294967297  6   104 A   MET 1.5175  0.2206  1.2968      0.0504  -0.3758 -0.7419 0.2495
4294967297  7   105 A   HIS 3.6987  0.0184      3.6804  0.0547  0.4019  -0.1489 0.3883
4294967297  8   106 A   THR 0.1048  0.0134  0.0914      0.0003  -0.7963 -0.4033 0.2013
4294967297  9   107 A   ASP 2.3626  0.0005      2.3620  0.0521  0.1955  -0.3499 -1.6300
4294967297  10  108 A   ILE 1.8447  0.0270  1.8176      0.0971  0.1676  -0.4071 1.0806
4294967297  11  109 A   ILE 0.1276  0.0092  0.1183      0.0208  -0.4026 -0.0075 1.0806
4294967297  12  110 A   SER 0.2921  0.0342  0.2578      0.0342  -0.2426 -1.3930 0.1654
4294967297  13  111 A   LEU 0.6483  0.0019  0.6464      0.0845  -0.3565 -0.2356 0.7611
4294967297  14  112 A   TRP 2.5965  0.1507      2.4457  0.5143  -0.1370 -0.5373 1.2341
4294967297  15  113 A   ASP 2.6448  0.1593          0.0510      -0.5011 

采纳答案by JohnE

I'm not sure I understand, but I'll give it a try. I usually use stack/unstack instead of pivot, is this closer to what you want?

我不确定我是否理解,但我会尝试一下。我通常使用堆栈/取消堆栈而不是枢轴,这是否更接近您想要的?

df.set_index(['struct_id','resNum','score_type_name']).unstack()

                  score_value                                              
score_type_name        fa_dun fa_dun_dev fa_dun_rot fa_dun_semi     omega   
struct_id  resNum                                                           
4294967297 1         2.185618   0.000027        NaN    2.185591  0.064840   
           2         1.378923   0.028560   1.350362         NaN  0.222345   
           3         0.020352   0.025507  -0.005156         NaN  0.005106   
           4         4.218029   0.003712        NaN    4.214317  0.212160   
           5         3.663050   0.004953        NaN         NaN  0.061867   


score_type_name     p_aa_pp      rama       ref  
struct_id  resNum                                
4294967297 1            NaN       NaN -1.191180  
           2      -0.442467 -0.795161  0.249477  
           3      -0.096847  0.267443  0.979644  
           4      -0.462765 -1.403292 -1.960940  
           5            NaN -0.600053       NaN  

I'm not sure why your pivot isn't working (kinda seems to me like it should, but I could be wrong), but it does seem to work (or at least not give an error) if I leave off 'struct_id'. Of course, that's not really a useful solution for the full dataset where you have more than one different values for 'struct_id'.

我不确定为什么你的枢轴不起作用(在我看来它应该是这样,但我可能是错的),但如果我不使用“struct_id”,它似乎确实有效(或者至少不会给出错误) . 当然,对于“struct_id”有多个不同值的完整数据集,这并不是一个真正有用的解决方案。

df.pivot(columns='score_type_name',values='score_value',index='resNum')

score_type_name    fa_dun  fa_dun_dev  fa_dun_rot  fa_dun_semi     omega  
resNum                                                                     
1                2.185618    0.000027         NaN     2.185591  0.064840   
2                1.378923    0.028560    1.350362          NaN  0.222345   
3                0.020352    0.025507   -0.005156          NaN  0.005106   
4                4.218029    0.003712         NaN     4.214317  0.212160   
5                3.663050    0.004953         NaN          NaN  0.061867   

score_type_name   p_aa_pp      rama       ref  
resNum                                         
1                     NaN       NaN -1.191180  
2               -0.442467 -0.795161  0.249477  
3               -0.096847  0.267443  0.979644  
4               -0.462765 -1.403292 -1.960940  
5                     NaN -0.600053       NaN  

Edit to add:reset_index()will convert from a multi-index (hierarchical) to a flatter style. There is still some hierarchy in the column names, sometimes the easiest way to get rid of those is just to do df.columns=['var1','var2',...]although there are more sophisticated ways if you do some searching.

编辑添加:reset_index()将从多索引(分层)转换为更扁平的样式。列名中仍然存在一些层次结构,有时摆脱这些的最简单方法就是这样做,df.columns=['var1','var2',...]尽管如果您进行一些搜索,还有更复杂的方法。

df.set_index(['struct_id','resNum','score_type_name']).unstack().reset_index()

df.set_index(['struct_id','resNum','score_type_name']).unstack().reset_index()

                  struct_id resNum score_value                            
score_type_name                         fa_dun fa_dun_dev fa_dun_rot   
0                4294967297      1    2.185618   0.000027        NaN   
1                4294967297      2    1.378923   0.028560   1.350362   
2                4294967297      3    0.020352   0.025507  -0.005156   
3                4294967297      4    4.218029   0.003712        NaN   
4                4294967297      5    3.663050   0.004953        NaN   

回答by tegancp

To get the dataframe you obtained from the pivot_tablecall into the format you want:

要将您从pivot_table调用中获得的数据帧转换为您想要的格式:

pivoted.columns.name=None  ## remove the score_type_name
result = pivoted.reset_index()  ## puts index columns back into dataframe body

回答by Day.ong Li

I debugged it a little bit.

我调试了一下。

  • The DataFrame.pivot()and DataFrame.pivot_table()are different.
  • pivot()doesn't accept a list for index.
  • pivot_table()accepts.
  • DataFrame.pivot()DataFrame.pivot_table()是不同的。
  • pivot()不接受索引列表。
  • pivot_table()接受。

Internally, both of them are using reset_index()/stack()/unstack()to do the job.

在内部,他们都使用reset_index()/ stack()/unstack()做的工作。

pivot()is just a short cut for simple usage, I think.

pivot()我认为这只是简单使用的捷径。

回答by Kevin Glynn

Another caveat:

另一个警告:

pivot_tablewill only allow numerical types as "values=", whereas pivotwill take string types as "values=".

pivot_table将只允许数字类型作为“values=”,而pivot将字符串类型作为“values=”。

回答by Tanachat

For anyone who is still interested in the difference between pivotand pivot_table, there are mainly two differences:

对于仍然对pivot和之间的区别感兴趣的任何人pivot_table,主要有两个区别:

  • pivot_tableis a generalization of pivotthat can handle duplicate values for one pivotedindex/column pair. Specifically, you can give pivot_tablea list of aggregation functions using keyword argument aggfunc. The default aggfuncof pivot_tableis numpy.mean.
  • pivot_tablealso supports using multiple columns for the index and column of the pivotedtable. A hierarchical index will be automatically generated for you.
  • pivot_tablepivot可以处理一个旋转索引/列对的重复值的概括。具体来说,您可以pivot_table使用关键字参数给出聚合函数列表aggfunc。默认aggfuncpivot_table就是numpy.mean
  • pivot_table还支持使用多列作为数据透视表的索引和列。将自动为您生成分层索引。

REF: pivotand pivot_table

参考:pivotpivot_table

回答by Asif Khan

The given snippet may help you out for further flatten the look of your dataframe

给定的代码段可以帮助您进一步扁平化数据框的外观

df.set_index(['struct_id','resNum','score_type_name']).unstack().reset_index()
df.loc[:,['struct_id','resNum','fa_dun','fa_dun_dev','fa_dun_rot']]

回答by Dheeraj

Before calling pivot we need to ensure that our data does not have rows with duplicate valuesfor the specified columns.

在调用 pivot 之前,我们需要确保我们的数据中没有指定列具有重复值

Pivot with duplicate give

枢轴重复给

Index contains duplicate entries, cannot reshape

If we can't ensure this we may have to use the pivot_tablemethod instead.

如果我们不能确保这一点,我们可能不得不使用pivot_table方法来代替。

Please find the link below for a more detailed explanation

请找到下面的链接以获得更详细的解释

https://nikgrozev.com/2015/07/01/reshaping-in-pandas-pivot-pivot-table-stack-and-unstack-explained-with-pictures/

https://nikgrozev.com/2015/07/01/reshaping-in-pandas-pivot-pivot-table-stack-and-unstack-explained-with-pictures/

回答by Elaine

pivot()is used for pivoting without aggregation. Therefore, it can't deal with duplicate values for one index/column pair.

pivot()用于没有聚合的旋转。因此,它无法处理一对索引/列的重复值。

Since here your index=['struct_id','resNum']have multiple duplicates, therefore pivot doesn't work.

由于这里您index=['struct_id','resNum']有多个重复项,因此数据透视不起作用。

However, pivot_tablewill work because it will handle duplicate values by aggregating them.

但是,pivot_table会起作用,因为它将通过聚合它们来处理重复值。