pandas 熊猫中的简单交叉表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9588331/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 15:39:35  来源:igfitidea点击:

Simple cross-tabulation in pandas

pythonpandasdataframepivot-table

提问by Jon Clements

I stumbled across pandasand it looks ideal for simple calculations that I'd like to do. I have a SAS background and was thinking it'd replace proc freq -- it looks like it'll scale to what I may want to do in the future. However, I just can't seem to get my head around a simple task (I'm not sure if I'm supposed to look at pivot/crosstab/indexing- whether I should have a Panelor DataFramesetc...). Could someone give me some pointers on how to do the following:

我偶然发现了熊猫,它看起来非常适合我想做的简单计算。我有 SAS 背景,并认为它会取代 proc freq - 看起来它会扩展到我将来可能想做的事情。但是,我似乎无法解决一个简单的任务(我不确定是否应该查看pivot/crosstab/indexing- 我是否应该有一个PanelDataFrames等等......)。有人可以给我一些有关如何执行以下操作的指示:

I have two CSV files (one for year 2010, one for year 2011 - simple transactional data) - The columns are category and amount

我有两个 CSV 文件(一个是 2010 年,一个是 2011 年 - 简单的交易数据) - 列是类别和金额

2010:

2010年:

AB,100.00
AB,200.00
AC,150.00
AD,500.00

2011:

2011年:

AB,500.00
AC,250.00
AX,900.00

These are loaded into separate DataFrame objects.

这些被加载到单独的 DataFrame 对象中。

What I'd like to do is get the category, the sum of the category, and the frequency of the category, eg:

我想要做的是获取类别、类别的总和以及类别的频率,例如:

2010:

2010年:

AB,300.00,2
AC,150.00,1
AD,500.00,1

2011:

2011年:

AB,500.00,1
AC,250.00,1
AX,900.00,1

I can't work out whether I should be using pivot/crosstab/groupby/an indexetc... I can get either the sum or the frequency - I can't seem to get both... It gets a bit more complex because I would like to do it on a month by month basis, but I think if someone would be so kind to point me to the right technique/direction I'll be able to go from there.

我不知道是否应该使用pivot/crosstab/groupby/an index等......我可以得到总和或频率 - 我似乎无法同时得到......它变得有点复杂,因为我想这样做逐月进行,但我认为如果有人会这么好心为我指出正确的技术/方向,我将能够从那里开始。

采纳答案by Jeff Hammerbacher

Assuming that you have a file called 2010.csv with contents

假设您有一个名为 2010.csv 的文件,其中包含内容

category,value
AB,100.00
AB,200.00
AC,150.00
AD,500.00

Then, using the ability to apply multiple aggregation functions following a groupby, you can say:

然后,使用在groupby 之后应用多个聚合函数的能力,您可以说:

import pandas
data_2010 = pandas.read_csv("/path/to/2010.csv")
data_2010.groupby("category").agg([len, sum])

You should get a result that looks something like

你应该得到一个看起来像的结果

          value     
            len  sum
category            
AB            2  300
AC            1  150
AD            1  500

Note that Wes will likely come by to point out that sum is optimized and that you should probably use np.sum.

请注意,Wes 可能会指出 sum 已优化,您可能应该使用 np.sum。

回答by Wes McKinney

v0.21answer

v0.21回答

Use pivot_tablewith the indexparameter:

pivot_tableindex参数一起使用:

df.pivot_table(index='category', aggfunc=[len, sum])

           len   sum
         value value
category            
AB           2   300
AC           1   150
AD           1   500


<= v0.12

<= v0.12

It is possible to do this using pivot_tablefor those interested:

pivot_table对于感兴趣的人,可以使用以下方法执行此操作:

In [8]: df
Out[8]: 
  category  value
0       AB    100
1       AB    200
2       AC    150
3       AD    500

In [9]: df.pivot_table(rows='category', aggfunc=[len, np.sum])
Out[9]: 
            len    sum
          value  value
category              
AB            2    300
AC            1    150
AD            1    500

Note that the result's columns are hierarchically indexed. If you had multiple data columns, you would get a result like this:

请注意,结果的列是分层索引的。如果你有多个数据列,你会得到这样的结果:

In [12]: df
Out[12]: 
  category  value  value2
0       AB    100       5
1       AB    200       5
2       AC    150       5
3       AD    500       5

In [13]: df.pivot_table(rows='category', aggfunc=[len, np.sum])
Out[13]: 
            len            sum        
          value  value2  value  value2
category                              
AB            2       2    300      10
AC            1       1    150       5
AD            1       1    500       5

The main reason to use __builtin__.sumvs. np.sumis that you get NA-handling from the latter. Probably could intercept the Python built-in, will make a note about that now.

使用__builtin__.sumvs. 的主要原因np.sum是您可以从后者获得 NA 处理。可能可以拦截 Python 内置的,现在将对此进行记录。