pandas 中的频率表(如 R 中的 plyr)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15589354/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Frequency tables in pandas (like plyr in R)
提问by user1043144
My problem is how to calculate frequencies on multiple variables in pandas . I have from this dataframe :
我的问题是如何计算 pandas 中多个变量的频率。我从这个数据框中得到:
d1 = pd.DataFrame( {'StudentID': ["x1", "x10", "x2","x3", "x4", "x5", "x6", "x7", "x8", "x9"],
'StudentGender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
'ExamenYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
'Exam': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
'Participated': ['no','yes','yes','yes','no','yes','yes','yes','yes','yes'],
'Passed': ['no','yes','yes','yes','no','yes','yes','yes','no','yes']},
columns = ['StudentID', 'StudentGender', 'ExamenYear', 'Exam', 'Participated', 'Passed'])
To the following result
到以下结果
Participated OfWhichpassed
ExamenYear
2007 3 2
2008 4 3
2009 3 2
(1) One possibility I tried is to compute two dataframe and bind them
(1) 我尝试过的一种可能性是计算两个数据帧并绑定它们
t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len)
t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len)
tx = pd.concat([t1, t2] , axis = 1)
Res1 = tx['yes']
(2) The second possibility is to use an aggregation function .
(2) 第二种可能是使用聚合函数。
import collections
dg = d1.groupby('ExamenYear')
Res2 = dg.agg({'Participated': len,'Passed': lambda x : collections.Counter(x == 'yes')[True]})
Res2.columns = ['Participated', 'OfWhichpassed']
Both ways are awckward to say the least. How is this done properly in pandas ?
至少可以说,这两种方式都很尴尬。 这是如何在Pandas中正确完成的?
P.S: I also tried value_countsinstead of collections.Counterbut could not get it to work
PS:我也尝试过value_counts而不是collections.Counter但无法让它工作
For reference: Few months ago, I asked similar question for R hereand plyrcould help
供参考:几个月前,我在这里为 R 提出了类似的问题 ,plyr可以提供帮助
---- UPDATE ------
- - 更新 - - -
user DSMis right. there was a mistake in the desired table result.
用户DSM是对的。所需的表结果中存在错误。
(1) The code for option one is
(1) 选项一的代码是
t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], aggfunc = len)
t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len)
t3 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len)
Res1 = pd.DataFrame( {'All': t1,
'OfWhichParticipated': t2['yes'],
'OfWhichPassed': t3['yes']})
It will produce the result
它会产生结果
All OfWhichParticipated OfWhichPassed
ExamenYear
2007 3 2 2
2008 4 3 3
2009 3 3 2
(2) For Option 2, thanks to user herrfz, I figured out how to use value_count and the code will be
(2) 对于选项 2,感谢用户herrfz,我想出了如何使用 value_count 并且代码将是
Res2 = d1.groupby('ExamenYear').agg({'StudentID': len,
'Participated': lambda x: x.value_counts()['yes'],
'Passed': lambda x: x.value_counts()['yes']})
Res2.columns = ['All', 'OfWgichParticipated', 'OfWhichPassed']
which will produce the same result as Res1
这将产生与 Res1 相同的结果
My question remains though:
我的问题仍然存在:
Using Option 2, will it be possible to use the same Variable twice (for another operation ?) can one pass a custom name for the resulting variable ?
使用选项 2,是否可以使用相同的变量两次(用于另一项操作?)是否可以为结果变量传递自定义名称?
---- A NEW UPDATE ----
---- 一个新的更新 ----
I have finally decided to use applywhich I understand is more flexible.
我终于决定使用申请我的理解是更加灵活。
采纳答案by herrfz
This:
这个:
d1.groupby('ExamenYear').agg({'Participated': len,
'Passed': lambda x: sum(x == 'yes')})
doesn't look way more awkward than the R solution, IMHO.
恕我直言,看起来并不比 R 解决方案更尴尬。
回答by user1043144
I finally decided to use apply.
我最终决定使用apply。
I am posting what I came up with hoping that it can be useful for others.
我发布了我想出的东西,希望它对其他人有用。
From what I understand from Wes' book "Python for Data analysis"
从我从 Wes 的书“Python for Data analysis”中了解到的
- applyis more flexible than agg and transform because you can define your own function.
- the only requirement is that the functions returns a pandas objector a scalar value.
- the inner mechanics: the function is called on each piece of the grouped object abd results are glued together using pandas.concat
- One needs to "hard-code" structure you want at the end
- apply比 agg 和 transform 更灵活,因为您可以定义自己的函数。
- 唯一的要求是函数返回一个pandas 对象或一个标量值。
- 内部机制:在每个分组对象上调用该函数 abd 结果使用pandas.concat粘合在一起
- 最后需要对你想要的结构进行“硬编码”
Here is what I came up with
这是我想出的
def ZahlOccurence_0(x):
return pd.Series({'All': len(x['StudentID']),
'Part': sum(x['Participated'] == 'yes'),
'Pass' : sum(x['Passed'] == 'yes')})
when I run it :
当我运行它时:
d1.groupby('ExamenYear').apply(ZahlOccurence_0)
I get the correct results
我得到正确的结果
All Part Pass
ExamenYear
2007 3 2 2
2008 4 3 3
2009 3 3 2
This approach would also allow me to combine frequencies with other stats
这种方法还可以让我将频率与其他统计数据结合起来
import numpy as np
d1['testValue'] = np.random.randn(len(d1))
def ZahlOccurence_1(x):
return pd.Series({'All': len(x['StudentID']),
'Part': sum(x['Participated'] == 'yes'),
'Pass' : sum(x['Passed'] == 'yes'),
'test' : x['testValue'].mean()})
d1.groupby('ExamenYear').apply(ZahlOccurence_1)
All Part Pass test
ExamenYear
2007 3 2 2 0.358702
2008 4 3 3 1.004504
2009 3 3 2 0.521511
I hope someone else will find this useful
我希望其他人会发现这很有用
回答by Ida
You may use pandas crosstabfunction, which by default computes a frequency table of two or more variables. For example,
您可以使用 Pandas交叉表函数,它默认计算两个或多个变量的频率表。例如,
> import pandas as pd
> pd.crosstab(d1['ExamenYear'], d1['Passed'])
Passed no yes
ExamenYear
2007 1 2
2008 1 3
2009 1 2
Use the margins=Trueoption if you also want to see the subtotal of each row and column.
margins=True如果您还想查看每行和每列的小计,请使用该选项。
> pd.crosstab(d1['ExamenYear'], d1['Participated'], margins=True)
Participated no yes All
ExamenYear
2007 1 2 3
2008 1 3 4
2009 0 3 3
All 2 8 10
回答by Karalga
There is another approach that I like to use for similar problems, it uses groupbyand unstack:
我喜欢用另一种方法解决类似的问题,它使用groupby和unstack:
d1 = pd.DataFrame({'StudentID': ["x1", "x10", "x2","x3", "x4", "x5", "x6", "x7", "x8", "x9"],
'StudentGender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
'ExamenYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
'Exam': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
'Participated': ['no','yes','yes','yes','no','yes','yes','yes','yes','yes'],
'Passed': ['no','yes','yes','yes','no','yes','yes','yes','no','yes']},
columns = ['StudentID', 'StudentGender', 'ExamenYear', 'Exam', 'Participated', 'Passed'])
(this is just the raw data from above)
(这只是上面的原始数据)
d2 = d1.groupby("ExamenYear").Participated.value_counts().unstack(fill_value=0)['yes']
d3 = d1.groupby("ExamenYear").Passed.value_counts().unstack(fill_value=0)['yes']
d2.name = "Participated"
d3.name = "Passed"
pd.DataFrame(data=[d2,d3]).T
Participated Passed
ExamenYear
2007 2 2
2008 3 3
2009 3 2
This solution is slightly more cumbersome than the one above using apply, but this one is easier to understand and extend, I feel.
这个解决方案比上面使用 apply 的解决方案稍微麻烦一些,但我觉得这个解决方案更容易理解和扩展。

