pandas 中的频率表（如 R 中的 plyr）

Question

提问by user1043144

My problem is how to calculate frequencies on multiple variables in pandas . I have from this dataframe :

我的问题是如何计算 pandas 中多个变量的频率。我从这个数据框中得到：

d1 = pd.DataFrame( {'StudentID': ["x1", "x10", "x2","x3", "x4", "x5", "x6",   "x7",     "x8", "x9"],
                       'StudentGender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
                 'ExamenYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
                 'Exam': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
                 'Participated': ['no','yes','yes','yes','no','yes','yes','yes','yes','yes'],
                  'Passed': ['no','yes','yes','yes','no','yes','yes','yes','no','yes']},
                  columns = ['StudentID', 'StudentGender', 'ExamenYear', 'Exam', 'Participated', 'Passed'])

To the following result

到以下结果

             Participated  OfWhichpassed
 ExamenYear                             
2007                   3              2
2008                   4              3
2009                   3              2

(1) One possibility I tried is to compute two dataframe and bind them

(1) 我尝试过的一种可能性是计算两个数据帧并绑定它们

t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len)
t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len)
tx = pd.concat([t1, t2] , axis = 1)

Res1 = tx['yes']

(2) The second possibility is to use an aggregation function .

(2) 第二种可能是使用聚合函数。

import collections
dg = d1.groupby('ExamenYear')
Res2 = dg.agg({'Participated': len,'Passed': lambda x : collections.Counter(x == 'yes')[True]})

 Res2.columns = ['Participated', 'OfWhichpassed']

Both ways are awckward to say the least. How is this done properly in pandas ?

至少可以说，这两种方式都很尴尬。 这是如何在Pandas中正确完成的？

P.S: I also tried value_countsinstead of collections.Counterbut could not get it to work

PS：我也尝试过value_counts而不是collections.Counter但无法让它工作

For reference: Few months ago, I asked similar question for R hereand plyrcould help

供参考：几个月前，我在这里为 R 提出了类似的问题，plyr可以提供帮助

---- UPDATE ------

- - 更新 - - -

user DSMis right. there was a mistake in the desired table result.

用户DSM是对的。所需的表结果中存在错误。

(1) The code for option one is

(1) 选项一的代码是

 t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], aggfunc = len)
 t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len)
 t3 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len)

 Res1 = pd.DataFrame( {'All': t1,
                       'OfWhichParticipated': t2['yes'],
                     'OfWhichPassed': t3['yes']})

It will produce the result

它会产生结果

             All  OfWhichParticipated  OfWhichPassed
ExamenYear                                         
2007          3                    2              2
2008          4                    3              3
2009          3                    3              2

(2) For Option 2, thanks to user herrfz, I figured out how to use value_count and the code will be

(2) 对于选项 2，感谢用户herrfz，我想出了如何使用 value_count 并且代码将是

Res2 = d1.groupby('ExamenYear').agg({'StudentID': len,
                                 'Participated': lambda x: x.value_counts()['yes'],
                                 'Passed': lambda x: x.value_counts()['yes']})

Res2.columns = ['All', 'OfWgichParticipated', 'OfWhichPassed']

which will produce the same result as Res1

这将产生与 Res1 相同的结果

My question remains though:

我的问题仍然存在：

Using Option 2, will it be possible to use the same Variable twice (for another operation ?) can one pass a custom name for the resulting variable ?

使用选项 2，是否可以使用相同的变量两次（用于另一项操作？）是否可以为结果变量传递自定义名称？

---- A NEW UPDATE ----

---- 一个新的更新 ----

I have finally decided to use applywhich I understand is more flexible.

我终于决定使用申请我的理解是更加灵活。

Answer 1

采纳答案by herrfz

This:

这个：

d1.groupby('ExamenYear').agg({'Participated': len, 
                              'Passed': lambda x: sum(x == 'yes')})

doesn't look way more awkward than the R solution, IMHO.

恕我直言，看起来并不比 R 解决方案更尴尬。

Answer 2

回答by user1043144

I finally decided to use apply.

我最终决定使用apply。

I am posting what I came up with hoping that it can be useful for others.

我发布了我想出的东西，希望它对其他人有用。

From what I understand from Wes' book "Python for Data analysis"

从我从 Wes 的书“Python for Data analysis”中了解到的

applyis more flexible than agg and transform because you can define your own function.
the only requirement is that the functions returns a pandas objector a scalar value.
the inner mechanics: the function is called on each piece of the grouped object abd results are glued together using pandas.concat
One needs to "hard-code" structure you want at the end

apply比 agg 和 transform 更灵活，因为您可以定义自己的函数。
唯一的要求是函数返回一个pandas 对象或一个标量值。
内部机制：在每个分组对象上调用该函数 abd 结果使用pandas.concat粘合在一起
最后需要对你想要的结构进行“硬编码”

Here is what I came up with

这是我想出的

def ZahlOccurence_0(x):
      return pd.Series({'All': len(x['StudentID']),
                       'Part': sum(x['Participated'] == 'yes'),
                       'Pass' :  sum(x['Passed'] == 'yes')})

when I run it :

当我运行它时：

 d1.groupby('ExamenYear').apply(ZahlOccurence_0)

I get the correct results

我得到正确的结果

            All  Part  Pass
ExamenYear                 
2007          3     2     2
2008          4     3     3
2009          3     3     2

This approach would also allow me to combine frequencies with other stats

这种方法还可以让我将频率与其他统计数据结合起来

import numpy as np
d1['testValue'] = np.random.randn(len(d1))

def ZahlOccurence_1(x):
    return pd.Series({'All': len(x['StudentID']),
        'Part': sum(x['Participated'] == 'yes'),
        'Pass' :  sum(x['Passed'] == 'yes'),
        'test' : x['testValue'].mean()})


d1.groupby('ExamenYear').apply(ZahlOccurence_1)


            All  Part  Pass      test
ExamenYear                           
2007          3     2     2  0.358702
2008          4     3     3  1.004504
2009          3     3     2  0.521511

I hope someone else will find this useful

我希望其他人会发现这很有用

Answer 3

回答by Ida

You may use pandas crosstabfunction, which by default computes a frequency table of two or more variables. For example,

您可以使用 Pandas交叉表函数，它默认计算两个或多个变量的频率表。例如，

> import pandas as pd
> pd.crosstab(d1['ExamenYear'], d1['Passed'])
Passed      no  yes
ExamenYear         
2007         1    2
2008         1    3
2009         1    2

Use the margins=Trueoption if you also want to see the subtotal of each row and column.

margins=True如果您还想查看每行和每列的小计，请使用该选项。

> pd.crosstab(d1['ExamenYear'], d1['Participated'], margins=True)
Participated  no  yes  All
ExamenYear                
2007           1    2    3
2008           1    3    4
2009           0    3    3
All            2    8   10

Answer 4

回答by Karalga

There is another approach that I like to use for similar problems, it uses groupbyand unstack:

我喜欢用另一种方法解决类似的问题，它使用groupby和unstack：

d1 = pd.DataFrame({'StudentID': ["x1", "x10", "x2","x3", "x4", "x5", "x6",   "x7",     "x8", "x9"],
                   'StudentGender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
                   'ExamenYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
                   'Exam': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
                   'Participated': ['no','yes','yes','yes','no','yes','yes','yes','yes','yes'],
                   'Passed': ['no','yes','yes','yes','no','yes','yes','yes','no','yes']},
                  columns = ['StudentID', 'StudentGender', 'ExamenYear', 'Exam', 'Participated', 'Passed'])

(this is just the raw data from above)

（这只是上面的原始数据）

d2 = d1.groupby("ExamenYear").Participated.value_counts().unstack(fill_value=0)['yes']
d3 = d1.groupby("ExamenYear").Passed.value_counts().unstack(fill_value=0)['yes']
d2.name = "Participated"
d3.name = "Passed"

pd.DataFrame(data=[d2,d3]).T
            Participated  Passed
ExamenYear                      
2007                   2       2
2008                   3       3
2009                   3       2

This solution is slightly more cumbersome than the one above using apply, but this one is easier to understand and extend, I feel.

这个解决方案比上面使用 apply 的解决方案稍微麻烦一些，但我觉得这个解决方案更容易理解和扩展。

pandas 中的频率表（如 R 中的 plyr）

提问by user1043144

采纳答案by herrfz

回答by user1043144

回答by Ida

回答by Karalga

相关推荐

最近更新

标签

pandas 中的频率表（如 R 中的 plyr）

提问by user1043144

采纳答案by herrfz

回答by user1043144

回答by Ida

回答by Karalga

相关推荐

Python Pandas 中最接近因子变量的等价物

pandas 获取熊猫日期时间索引的先前值

pandas 熊猫聚合的条件总和

pandas 使用pandas.io.sql.read_frame，我可以像read_csv那样解析日期吗？

相关推荐

最近更新

标签