pandas 按组标准化 DataFrame

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26046208/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:30:26  来源:igfitidea点击:

Normalize DataFrame by group

pythonpandas

提问by JoshAdel

Let's say that I have some data generated as follows:

假设我生成了一些数据,如下所示:

N = 20
m = 3
data = np.random.normal(size=(N,m)) + np.random.normal(size=(N,m))**3

and then I create some categorization variable:

然后我创建了一些分类变量:

indx = np.random.randint(0,3,size=N).astype(np.int32)

and generate a DataFrame:

并生成一个数据帧:

import pandas as pd
df = pd.DataFrame(np.hstack((data, indx[:,None])), 
             columns=['a%s' % k for k in range(m)] + [ 'indx'])

I can get the mean value, per group as:

我可以得到每组的平均值:

df.groubpy('indx').mean()

What I'm unsure of how to do is to then subtract the mean off of each group, per-column in the original data, so that the data in each column is normalized by the mean within group. Any suggestions would be appreciated.

我不确定如何做的是然后减去原始数据中每列每组的平均值,以便每列中的数据按组内的平均值进行归一化。任何建议,将不胜感激。

回答by TomAugspurger

In [10]: df.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())

should do it.

应该这样做。

回答by w-m

If the data contains many groups (thousands or more), the accepted answermay take a very long time to compute.

如果数据包含许多组(数千个或更多),则接受的答案可能需要很长时间来计算。

Even though groupby.transformitself is fast, as are the already vectorized calls in the lambda function (.mean(), .std()and the subtraction), the call to the pure Python function for each group creates a considerable overhead.

尽管groupby.transform本身是快速的,因为是在lambda函数已经矢量调用(.mean().std()和减法),在调用纯Python功能为每个组创建了一个相当大的开销。

This can be avoided by using pure vectorized Pandas/Numpy calls and not writing any Python method, as shown in ErnestScribbler's answer.

这可以通过使用纯矢量化 Pandas/Numpy 调用而不编写任何 Python 方法来避免,如ErnestScribbler 的回答所示。

We can get around the headache of merging and naming the columns by leveraging the broadcasting abilities of .transform:

我们可以通过利用以下广播功能来解决合并和命名列的麻烦.transform

def normalize_by_group(df, by):
    groups = df.groupby(by)
    # computes group-wise mean/std,
    # then auto broadcasts to size of group chunk
    mean = groups.transform(np.mean)
    std = groups.transform(np.std)
    return (df[mean.columns] - mean) / std

For benchmarking I changed the data generation from the original question to allow for more groups:

对于基准测试,我更改了原始问题的数据生成以允许更多组:

def gen_data(N, num_groups):
    m = 3
    data = np.random.normal(size=(N,m)) + np.random.normal(size=(N,m))**3
    indx = np.random.randint(0,num_groups,size=N).astype(np.int32)

    df = pd.DataFrame(np.hstack((data, indx[:,None])), 
                      columns=['a%s' % k for k in range(m)] + [ 'indx'])
    return df

With only two groups (thus only two Python function calls), the lambda version is only about 1.8x slower than the numpy code:

只有两组(因此只有两个 Python 函数调用),lambda 版本仅比 numpy 代码慢 1.8 倍:

In: df2g = gen_data(10000, 2)  # 3 cols, 10000 rows, 2 groups

In: %timeit normalize_by_group(df2g, "indx")
6.61 ms ± 72.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In: %timeit df2g.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
12.3 ms ± 130 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Increasing the number of groups to 1000, and the runtime issue becomes apparent. The lambda version is 370x slower than the numpy code:

将组数增加到 1000,运行时问题变得明显。lambda 版本比 numpy 代码慢 370 倍:

In: df1000g = gen_data(10000, 1000)  # 3 cols, 10000 rows, 1000 groups

In: %timeit normalize_by_group(df1000g, "indx")
7.5 ms ± 87.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In: %timeit df1000g.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
2.78 s ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

回答by ErnestScribbler

The accepted answer works and is elegant. Unfortunately, for large datasets I think performance-wise using .transform() is much much slower than doing the less elegant following (illustrated with a single column 'a0'):

接受的答案有效且优雅。不幸的是,对于大型数据集,我认为使用 .transform() 在性能方面比执行不太优雅的以下操作慢得多(用单列“a0”表示):

means_stds = df.groupby('indx')['a0'].agg(['mean','std']).reset_index()
df = df.merge(means_stds,on='indx')
df['a0_normalized'] = (df['a0'] - df['mean']) / df['std']

To do it for multiple columns you'll have to figure out the merge. My suggestion would be to flatten the multiindex columns from aggregation as in this answerand then merge and normalize for each column separately:

要对多列执行此操作,您必须弄清楚合并。我的建议是将多索引列从聚合中展平,如本答案所示,然后分别对每一列进行合并和规范化:

means_stds = df.groupby('indx')[['a0','a1']].agg(['mean','std']).reset_index()
means_stds.columns = ['%s%s' % (a, '|%s' % b if b else '') for a, b in means_stds.columns]
df = df.merge(means_stds,on='indx')
for col in ['a0','a1']:
    df[col+'_normalized'] = ( df[col] - df[col+'|mean'] ) / df[col+'|std']

回答by Mike

Although this is not the prettiest solution, you could do something like this:

虽然这不是最漂亮的解决方案,但您可以执行以下操作:

indx = df['indx'].copy()
for indices in df.groupby('indx').groups.values():
    df.loc[indices] -= df.loc[indices].mean()
df['indx'] = indx