应用自定义 groupby 聚合函数在 Pandas python 中输出二进制结果

Question

提问by finstats

I have a dataset of trader transactions where the variable of interest is Buy/Sellwhich is binary and takes on the value of 1 f the transaction was a buy and 0 if it is a sell. An example looks as follows:

我有一个交易者交易数据集，其中感兴趣的变量Buy/Sell是二进制的，如果交易是买入，则值为 1，如果是卖出，则值为 0。一个示例如下所示：

Trader     Buy/Sell
  A           1
  A           0
  B           1
  B           1
  B           0
  C           1
  C           0
  C           0

I would like to calculate the net Buy/Sellfor each trader such that if the trader had more than 50% of trades as a buy, he would have a Buy/Sellof 1, if he had less than 50% buy then he would have a Buy/Sellof 0 and if it were exactly 50% he would have NA (and would be disregarded in future calculations).

我想计算Buy/Sell每个交易者的净额，如果交易者有超过 50% 的交易作为买入，他的 aBuy/Sell为 1，如果他买入的比例低于 50%，那么他的 aBuy/Sell为 0，如果正好是 50% 他会有 NA（并且在未来的计算中会被忽略）。

So for trader A, the buy proportion is (number of buys)/(total number of trader) = 1/2 = 0.5 which gives NA.

因此，对于交易者 A，买入比例为（买入数量）/（交易者总数）= 1/2 = 0.5，即为 NA。

For trader B it is 2/3 = 0.67 which gives a 1

对于交易者 B，它是 2/3 = 0.67，这给出了 1

For trader C it is 1/3 = 0.33 which gives a 0

对于交易者 C，它是 1/3 = 0.33，这给出了 0

The table should look like this:

该表应如下所示：

Trader     Buy/Sell
  A           NA
  B           1
  C           0

Ultimately i want to compute the total aggregated number of buys, which in this case is 1, and the aggregated total number of trades (disregarding NAs) which in this case is 2. I am not interested in the second table, I am just interested in the aggregated number of buys and the aggregated total number (count) of Buy/Sell.

最终，我想计算总购买次数，在这种情况下为 1，而在这种情况下为 2 的总交易总数（不考虑 NA）。我对第二个表不感兴趣，我只是感兴趣在总购买次数和总购买次数（计数）中Buy/Sell。

How can I do this in Pandas?

我怎样才能在 Pandas 中做到这一点？

Answer 1

采纳答案by unutbu

import numpy as np
import pandas as pd

df = pd.DataFrame({'Buy/Sell': [1, 0, 1, 1, 0, 1, 0, 0],
                   'Trader': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']})

grouped = df.groupby(['Trader'])
result = grouped['Buy/Sell'].agg(['sum', 'count'])
means = grouped['Buy/Sell'].mean()
result['Buy/Sell'] = np.select(condlist=[means>0.5, means<0.5], choicelist=[1, 0], 
    default=np.nan)
print(result)

yields

产量

        Buy/Sell  sum  count
Trader                      
A            NaN    1      2
B              1    2      3
C              0    1      3

My original answer used a custom aggregator, categorize:

我的原始答案使用了自定义聚合器categorize：

def categorize(x):
    m = x.mean()
    return 1 if m > 0.5 else 0 if m < 0.5 else np.nan
result = df.groupby(['Trader'])['Buy/Sell'].agg([categorize, 'sum', 'count'])
result = result.rename(columns={'categorize' : 'Buy/Sell'})

While calling a custom function may be convenient, performance is often significantly slower when you use a custom function compared to the built-in aggregators (such as groupby/agg/mean). The built-in aggregators are Cythonized, while the custom functions reduce performance to plain Python for-loop speeds.

虽然调用自定义函数可能很方便，但与内置聚合器（例如groupby/agg/mean）相比，当您使用自定义函数时，性能通常会显着降低。内置聚合器是 Cythonized，而自定义函数将性能降低到纯 Python for 循环速度。

The difference in speed is particularly significant when the number of groups is large. For example, with a 10000-row DataFrame with 1000 groups,

当组数较多时，速度差异尤为显着。例如，具有 1000 个组的 10000 行 DataFrame，

import numpy as np
import pandas as pd
np.random.seed(2017)
N = 10000
df = pd.DataFrame({
    'Buy/Sell': np.random.randint(2, size=N),
    'Trader': np.random.randint(1000, size=N)})

def using_select(df):
    grouped = df.groupby(['Trader'])
    result = grouped['Buy/Sell'].agg(['sum', 'count'])
    means = grouped['Buy/Sell'].mean()
    result['Buy/Sell'] = np.select(condlist=[means>0.5, means<0.5], choicelist=[1, 0], 
        default=np.nan)
    return result

def categorize(x):
    m = x.mean()
    return 1 if m > 0.5 else 0 if m < 0.5 else np.nan

def using_custom_function(df):
    result = df.groupby(['Trader'])['Buy/Sell'].agg([categorize, 'sum', 'count'])
    result = result.rename(columns={'categorize' : 'Buy/Sell'})
    return result

using_selectis over 50x faster than using_custom_function:

using_select比using_custom_function以下速度快 50 倍以上：

In [69]: %timeit using_custom_function(df)
10 loops, best of 3: 132 ms per loop

In [70]: %timeit using_select(df)
100 loops, best of 3: 2.46 ms per loop

In [71]: 132/2.46
Out[71]: 53.65853658536585

Answer 2

回答by SGI

Pandas cut()provides an improvement in @unutbu's answer by getting the result in half the time.

Pandascut()以一半的时间获得结果，从而改进了@unutbu 的答案。

def using_select(df):
    grouped = df.groupby(['Trader'])
    result = grouped['Buy/Sell'].agg(['sum', 'count'])
    means = grouped['Buy/Sell'].mean()
    result['Buy/Sell'] = np.select(condlist=[means>0.5, means<0.5], choicelist=[1, 0], 
        default=np.nan)
    return result


def using_cut(df):
    grouped = df.groupby(['Trader'])
    result = grouped['Buy/Sell'].agg(['sum', 'count', 'mean'])
    result['Buy/Sell'] = pd.cut(result['mean'], [0, 0.5, 1], labels=[0, 1], include_lowest=True)
    result['Buy/Sell']=np.where(result['mean']==0.5,np.nan, result['Buy/Sell'])
    return result

using_cut()runs in 5.21 ms average per loop in my system whereas using_select()runs in 10.4 ms average per loop.

using_cut()在我的系统中每个循环平均运行 5.21 毫秒，而using_select()每个循环平均运行 10.4 毫秒。

%timeit using_select(df)
10.4 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit using_cut(df)
5.21 ms ± 147 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

应用自定义 groupby 聚合函数在 Pandas python 中输出二进制结果

提问by finstats

采纳答案by unutbu

回答by SGI

相关推荐

最近更新

标签

应用自定义 groupby 聚合函数在 Pandas python 中输出二进制结果

提问by finstats

采纳答案by unutbu

回答by SGI

相关推荐

Python Sockets - 将数据包发送到服务器并等待响应

Python：如何让程序等到函数或方法完成

Python 全局变量在模块级别未定义

Python 管理 Tweepy API 搜索

相关推荐

最近更新

标签