pandas 熊猫中的简单交叉表
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9588331/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Simple cross-tabulation in pandas
提问by Jon Clements
I stumbled across pandasand it looks ideal for simple calculations that I'd like to do. I have a SAS background and was thinking it'd replace proc freq -- it looks like it'll scale to what I may want to do in the future. However, I just can't seem to get my head around a simple task (I'm not sure if I'm supposed to look at pivot/crosstab/indexing- whether I should have a Panelor DataFramesetc...). Could someone give me some pointers on how to do the following:
我偶然发现了熊猫,它看起来非常适合我想做的简单计算。我有 SAS 背景,并认为它会取代 proc freq - 看起来它会扩展到我将来可能想做的事情。但是,我似乎无法解决一个简单的任务(我不确定是否应该查看pivot/crosstab/indexing- 我是否应该有一个Panel或DataFrames等等......)。有人可以给我一些有关如何执行以下操作的指示:
I have two CSV files (one for year 2010, one for year 2011 - simple transactional data) - The columns are category and amount
我有两个 CSV 文件(一个是 2010 年,一个是 2011 年 - 简单的交易数据) - 列是类别和金额
2010:
2010年:
AB,100.00
AB,200.00
AC,150.00
AD,500.00
2011:
2011年:
AB,500.00
AC,250.00
AX,900.00
These are loaded into separate DataFrame objects.
这些被加载到单独的 DataFrame 对象中。
What I'd like to do is get the category, the sum of the category, and the frequency of the category, eg:
我想要做的是获取类别、类别的总和以及类别的频率,例如:
2010:
2010年:
AB,300.00,2
AC,150.00,1
AD,500.00,1
2011:
2011年:
AB,500.00,1
AC,250.00,1
AX,900.00,1
I can't work out whether I should be using pivot/crosstab/groupby/an indexetc... I can get either the sum or the frequency - I can't seem to get both... It gets a bit more complex because I would like to do it on a month by month basis, but I think if someone would be so kind to point me to the right technique/direction I'll be able to go from there.
我不知道是否应该使用pivot/crosstab/groupby/an index等......我可以得到总和或频率 - 我似乎无法同时得到......它变得有点复杂,因为我想这样做逐月进行,但我认为如果有人会这么好心为我指出正确的技术/方向,我将能够从那里开始。
采纳答案by Jeff Hammerbacher
Assuming that you have a file called 2010.csv with contents
假设您有一个名为 2010.csv 的文件,其中包含内容
category,value
AB,100.00
AB,200.00
AC,150.00
AD,500.00
Then, using the ability to apply multiple aggregation functions following a groupby, you can say:
然后,使用在groupby 之后应用多个聚合函数的能力,您可以说:
import pandas
data_2010 = pandas.read_csv("/path/to/2010.csv")
data_2010.groupby("category").agg([len, sum])
You should get a result that looks something like
你应该得到一个看起来像的结果
value
len sum
category
AB 2 300
AC 1 150
AD 1 500
Note that Wes will likely come by to point out that sum is optimized and that you should probably use np.sum.
请注意,Wes 可能会指出 sum 已优化,您可能应该使用 np.sum。
回答by Wes McKinney
v0.21answer
v0.21回答
Use pivot_tablewith the indexparameter:
pivot_table与index参数一起使用:
df.pivot_table(index='category', aggfunc=[len, sum])
len sum
value value
category
AB 2 300
AC 1 150
AD 1 500
<= v0.12
<= v0.12
It is possible to do this using pivot_tablefor those interested:
pivot_table对于感兴趣的人,可以使用以下方法执行此操作:
In [8]: df
Out[8]:
category value
0 AB 100
1 AB 200
2 AC 150
3 AD 500
In [9]: df.pivot_table(rows='category', aggfunc=[len, np.sum])
Out[9]:
len sum
value value
category
AB 2 300
AC 1 150
AD 1 500
Note that the result's columns are hierarchically indexed. If you had multiple data columns, you would get a result like this:
请注意,结果的列是分层索引的。如果你有多个数据列,你会得到这样的结果:
In [12]: df
Out[12]:
category value value2
0 AB 100 5
1 AB 200 5
2 AC 150 5
3 AD 500 5
In [13]: df.pivot_table(rows='category', aggfunc=[len, np.sum])
Out[13]:
len sum
value value2 value value2
category
AB 2 2 300 10
AC 1 1 150 5
AD 1 1 500 5
The main reason to use __builtin__.sumvs. np.sumis that you get NA-handling from the latter. Probably could intercept the Python built-in, will make a note about that now.
使用__builtin__.sumvs. 的主要原因np.sum是您可以从后者获得 NA 处理。可能可以拦截 Python 内置的,现在将对此进行记录。

