pandas 熊猫，将多列的多个功能应用于 groupby 对象

Question

提问by johnbaltis

I want to apply multiple functions of multiple columns to a groupby object which results in a new pandas.DataFrame.

我想将多列的多个函数应用于 groupby 对象，这会产生一个新的pandas.DataFrame.

I know how to do it in seperate steps:

我知道如何在单独的步骤中做到这一点：

by_user = lasts.groupby('user')
elapsed_days = by_user.apply(lambda x: (x.elapsed_time * x.num_cores).sum() / 86400)
running_days = by_user.apply(lambda x: (x.running_time * x.num_cores).sum() / 86400)
user_df = elapsed_days.to_frame('elapsed_days').join(running_days.to_frame('running_days'))

Which results in user_dfbeing:

结果user_df是：

However I suspect that there is a better way, like:

但是我怀疑有更好的方法，例如：

by_user.agg({'elapsed_days': lambda x: (x.elapsed_time * x.num_cores).sum() / 86400, 
             'running_days': lambda x: (x.running_time * x.num_cores).sum() / 86400})

However, this doesn't work, because AFAIK agg()works on pandas.Series.

但是，这不起作用，因为 AFAIKagg()适用于pandas.Series.

I did find this question and answer, but the solutions look rather ugly to me, and considering that the answer is nearly four years old, there might be a better way by now.

我确实找到了这个问题和答案，但解决方案对我来说看起来很丑陋，考虑到答案已经将近四年了，现在可能有更好的方法。

Answer 1

采纳答案by jezrael

I think you can avoid aggor applyand rather first multiple by mul, then divand last use groupbyby indexwith aggregatingsum:

我认为你能避免agg或apply与第一，而通过多次mul，然后div和最后使用groupby的index有aggregatingsum：

lasts = pd.DataFrame({'user':['a','s','d','d'],
                   'elapsed_time':[40000,50000,60000,90000],
                   'running_time':[30000,20000,30000,15000],
                   'num_cores':[7,8,9,4]})

print (lasts)
   elapsed_time  num_cores  running_time user
0         40000          7         30000    a
1         50000          8         20000    s
2         60000          9         30000    d
3         90000          4         15000    d

by_user = lasts.groupby('user')
elapsed_days = by_user.apply(lambda x: (x.elapsed_time * x.num_cores).sum() / 86400)
print (elapsed_days)
running_days = by_user.apply(lambda x: (x.running_time * x.num_cores).sum() / 86400)
user_df = elapsed_days.to_frame('elapsed_days').join(running_days.to_frame('running_days'))
print (user_df)
      elapsed_days  running_days
user                            
a         3.240741      2.430556
d        10.416667      3.819444
s         4.629630      1.851852

lasts = lasts.set_index('user')
print (lasts[['elapsed_time','running_time']].mul(lasts['num_cores'], axis=0)
                                             .div(86400)
                                             .groupby(level=0)
                                             .sum())
      elapsed_time  running_time
user                            
a         3.240741      2.430556
d        10.416667      3.819444
s         4.629630      1.851852

Answer 2

回答by zthomas.nc

Another solid variation of the solution is to do what @MaxU did with this solutionto a similar questionand wrap the individual functions in a Pandas series, thus only requiring a reset_index()to return a dataframe.

该解决方案的另一个固体变化是做什么@MaxU做了与此解决方案，以一个类似的问题，敷各个功能的Pandas系列，因此只需要reset_index()返回一个数据帧。

First, define the functions for transformations:

首先，定义转换函数：

def ed(group):
    return group.elapsed_time * group.num_cores).sum() / 86400

def rd(group):
    return group.running_time * group.num_cores).sum() / 86400

Wrap them up in a Series using get_stats:

使用以下方法将它们组合成一个系列get_stats：

def get_stats(group):
    return pd.Series({'elapsed_days': ed(group),
                      'running_days':rd(group)})

Finally:

最后：

lasts.groupby('user').apply(get_stats).reset_index()

Answer 3

回答by jrjc

To use the aggmethod on a groupbyobject by using data from other columns of the same dataframe you could do the following:

要通过使用来自同一数据帧的其他列的数据agg在groupby对象上使用该方法，您可以执行以下操作：

Define your functions (lambdafunctions or not) that take as an input a Series, and get the data from other column(s) using the df.loc[series.index, col]syntax. With this example:
```
ed = lambda x: (x * lasts.loc[x.index, "num_cores"]).sum() / 86400. 
rd = lambda x: (x * lasts.loc[x.index, "num_cores"]).sum() / 86400.
```
where lastsis the main DataFrame, and we access the data in the column num_coresthanks to the .locmethod.
Create a dictionary with these functions and the name for the newly created columns. The keys are the name of the columns on which to apply each function, and the value is another dictionary where the key is the name of the function and the value is the function.
```
my_func = {"elapsed_time" : {"elapsed_day" : ed},
           "running_time" : {"running_days" : rd}}
```

Groupby and aggregate:

user_df = lasts.groupby("user").agg(my_func)
user_df
     elapsed_time running_time
      elapsed_day running_days
user                          
a        3.240741     2.430556
d       10.416667     3.819444
s        4.629630     1.851852

If you want to remove the old column names:

 user_df.columns = user_df.columns.droplevel(0)
 user_df
      elapsed_day  running_days
user                           
a        3.240741      2.430556
d       10.416667      3.819444
s        4.629630      1.851852

定义将lambdaa 作为输入的函数（函数与否）Series，并使用df.loc[series.index, col]语法从其他列中获取数据。用这个例子：
```
ed = lambda x: (x * lasts.loc[x.index, "num_cores"]).sum() / 86400. 
rd = lambda x: (x * lasts.loc[x.index, "num_cores"]).sum() / 86400.
```
lasts主 DataFrame在哪里，num_cores由于该.loc方法，我们可以访问列中的数据。
使用这些函数和新创建的列的名称创建一个字典。键是应用每个函数的列的名称，值是另一个字典，其中键是函数的名称，值是函数。
```
my_func = {"elapsed_time" : {"elapsed_day" : ed},
           "running_time" : {"running_days" : rd}}
```

分组和聚合：

user_df = lasts.groupby("user").agg(my_func)
user_df
     elapsed_time running_time
      elapsed_day running_days
user                          
a        3.240741     2.430556
d       10.416667     3.819444
s        4.629630     1.851852

如果要删除旧的列名：

 user_df.columns = user_df.columns.droplevel(0)
 user_df
      elapsed_day  running_days
user                           
a        3.240741      2.430556
d       10.416667      3.819444
s        4.629630      1.851852

HTH

Answer 4

回答by jeremycg

In response to the bounty, we can make it more general, by using partial application, from the standard libraries functools.partialfunction.

为了响应赏金，我们可以通过使用部分应用程序，从标准库functools.partial函数中使其更通用。

import functools
import pandas as pd

#same data as other answer:
lasts = pd.DataFrame({'user':['a','s','d','d'],
                   'elapsed_time':[40000,50000,60000,90000],
                   'running_time':[30000,20000,30000,15000],
                   'num_cores':[7,8,9,4]})

#define the desired lambda as a function:
def myfunc(column, df, cores):
    return (column * df.ix[column.index][cores]).sum()/86400

#use the partial to define the function with a given column and df:
mynewfunc = functools.partial(myfunc, df = lasts, cores = 'num_cores')

#agg by the partial function
lasts.groupby('user').agg({'elapsed_time':mynewfunc, 'running_time':mynewfunc})

Which gives us:

这给了我们：

    running_time    elapsed_time
user        
a   2.430556    3.240741
d   3.819444    10.416667
s   1.851852    4.629630

This is not super useful for the example given, but may be more useful as a general example.

这对于给定的示例不是非常有用，但作为一般示例可能更有用。

Answer 5

回答by chthonicdaemon

Here is a solution which closely resembles the original idea expressed under "I suspect there is a better way".

这是一个与“我怀疑有更好的方法”下表达的原始想法非常相似的解决方案。

I'll use the same testing data as the other answers:

我将使用与其他答案相同的测试数据：

lasts = pd.DataFrame({'user':['a','s','d','d'],
                      'elapsed_time':[40000,50000,60000,90000],
                      'running_time':[30000,20000,30000,15000],
                      'num_cores':[7,8,9,4]})

groupby.applycan accept a function which returns a dataframe and will then automatically stitch the returned dataframes together. There are two small catches in the wording below. The first is noticing that the values passed to DataFrameare in fact single-element lists instead of just numbers.

groupby.apply可以接受一个返回数据帧的函数，然后自动将返回的数据帧拼接在一起。下面的措辞有两个小问题。首先是注意到传递给的值DataFrame实际上是单元素列表而不仅仅是数字。

def aggfunc(group):
    """ This function mirrors the OP's idea. Note the values below are lists """
    return pd.DataFrame({'elapsed_days': [(group.elapsed_time * group.num_cores).sum() / 86400], 
                         'running_days': [(group.running_time * group.num_cores).sum() / 86400]})

user_df = lasts.groupby('user').apply(aggfunc)

Result:

结果：

        elapsed_days  running_days
user                              
a    0      3.240741      2.430556
d    0     10.416667      3.819444
s    0      4.629630      1.851852

The second is that the returned dataframe has a hierarchical index (that column of zeros), which can be flattened as shown below:

第二个是返回的数据帧有一个分层索引（那列零），它可以被展平，如下所示：

user_df.index = user_df.index.levels[0]

Result:

结果：

      elapsed_days  running_days
user                            
a         3.240741      2.430556
d        10.416667      3.819444
s         4.629630      1.851852

Answer 6

回答by jberrio

This agg function might be what you're looking for.

这个 agg 函数可能就是你要找的。

I added an example dataset and applied the operation to a copy of lastswhich I named lasts_.

我添加了一个示例数据集并将该操作应用于lasts我命名为的副本lasts_。

import pandas as pd

lasts = pd.DataFrame({'user'        :['james','james','james','john','john'],
                      'elapsed_time':[ 200000, 400000, 300000,800000,900000],
                      'running_time':[ 100000, 100000, 200000,600000,700000],
                      'num_cores'   :[      4,      4,      4,     8,     8] })

# create temporary df to add columns to, without modifying original dataframe
lasts_ = pd.Series.to_frame(lasts.loc[:,'user'])  # using 'user' column to initialize copy of new dataframe.  to_frame gives dataframe instead of series so more columns can be added below
lasts_['elapsed_days'] = lasts.loc[:,'elapsed_time'] * lasts.loc[:,'num_cores'] / 86400
lasts_['running_days'] = lasts.loc[:,'running_time'] * lasts.loc[:,'num_cores'] / 86400

# aggregate
by_user = lasts_.groupby('user').agg({'elapsed_days': 'sum', 
                                      'running_days': 'sum' })

# by_user:
# user  elapsed_days        running_days
# james 41.66666666666667   18.51851851851852
# john  157.4074074074074   120.37037037037037

If you want to keep 'user' as normal column instead of index column, use:

如果要将“用户”保留为普通列而不是索引列，请使用：

by_user = lasts_.groupby('user', as_index=False).agg({'elapsed_days': 'sum', 
                                                      'running_days': 'sum'})

pandas 熊猫，将多列的多个功能应用于 groupby 对象

提问by johnbaltis

采纳答案by jezrael

回答by zthomas.nc

回答by jrjc

回答by jeremycg

回答by chthonicdaemon

回答by jberrio

相关推荐

最近更新

标签

pandas 熊猫，将多列的多个功能应用于 groupby 对象

提问by johnbaltis

采纳答案by jezrael

回答by zthomas.nc

回答by jrjc

回答by jeremycg

回答by chthonicdaemon

回答by jberrio

相关推荐

pandas 未知标签类型 sklearn

pandas 向散景图添加标签

pandas 熊猫经纬度到连续行之间的距离

Pandas DataFrame - 使用 ols/线性回归时，“无法将日期时间类型从 [datetime64[ns]] 转换为 [float64]”

相关推荐

最近更新

标签