pandas 如何使用pandas groupby()的split-apply-combine模式同时规范化多列

Question

提问by volkerH

I am trying to normalize experimental data in a pandas data table that contains multiple columns with numerical observables (features), columns with date and experiment conditions as well as additional non-numerical conditions such as filenames.

我正在尝试规范化Pandas数据表中的实验数据，该数据表包含具有数值可观察量（特征）的多列、具有日期和实验条件的列以及其他非数值条件（例如文件名）。

I would like to

我想要

use the split-apply-combine paradigm
normalize within groups, using aggregate statistics of subgroups
use different normalizations (e.g. divide-by-control-mean, Z-score)
apply this to all numerical columns (observables)
finally, generate an augmented data table which has the same structure as the original, but with additional columns, e.g. for column Observable1 a column normalized_Observable1 should be added

使用拆分-应用-组合范例
使用子组的聚合统计数据在组内标准化
使用不同的归一化（例如除以控制均值、Z 分数）
将此应用于所有数字列（可观察值）
最后，生成与原始结构相同的扩充数据表，但增加了列，例如，对于列 Observable1，应添加列 normalized_Observable1

A simplified data table that has this structure can be generated with this code snippet::

可以使用以下代码片段生成具有此结构的简化数据表：

import numpy as np
import pandas as pd
df = pd.DataFrame({
   'condition': ['ctrl', 'abc', 'ctrl', 'abc', 'def', 'ctlr', 'ctlr', 'asdasd', 'afff', 'afff', 'gr1','gr2', 'gr2', 'ctrl', 'ctrl', 'kjkj','asht','ctrl'],
   'date':  ['20170131', '20170131', '20170131', '20170131','20170131', '20170606', '20170606', '20170606', '20170606', '20170606', '20170404', '20170404', '20170404', '20170404', '20170404', '20161212', '20161212', '20161212'],
   'observation1':  [1.2, 2.2, 1.3, 1.1, 2.3 , 2.3, 4.2, 3.3, 5.1, 3.3, 3.4, 5.5, 9.9, 3.2, 1.1, 3.3, 1.2, 5.4],
   'observation2':  [3.1, 2.2, 2.1, 1.2,  2.4, 1.2, 1.5, 1.33, 1.5, 1.6, 1.4, 1.3, 0.9, 0.78, 1.2, 4.0, 5.0, 6.0],
   'observation3':  [2.0, 1.2, 1.2, 2.01, 2.55, 2.05, 1.66, 3.2, 3.21, 3.04, 8.01, 9.1, 7.06, 8.1, 7.9, 5.12, 5.23, 5.15],
   'rawsource': ["1.tif", "2.tif", "3.tif",  "4.tif", "5.tif","6.tif", "7.tif", "8.tif", "9.tif", "10.tif", "11.tif", "12.tif", "13.tif", "14.tif", "15.tif", "16.tif", "17.tif", "18.tif"]
})
print(df)

and would look like this

看起来像这样

   condition      date  observation1  observation2  observation3 rawsource
0       ctrl  20170131           1.2          3.10          2.00     1.tif
1        abc  20170131           2.2          2.20          1.20     2.tif
2       ctrl  20170131           1.3          2.10          1.20     3.tif
3        abc  20170131           1.1          1.20          2.01     4.tif
4        def  20170131           2.3          2.40          2.55     5.tif
5       ctlr  20170606           2.3          1.20          2.05     6.tif
6       ctlr  20170606           4.2          1.50          1.66     7.tif
7     asdasd  20170606           3.3          1.33          3.20     8.tif
8       afff  20170606           5.1          1.50          3.21     9.tif
9       afff  20170606           3.3          1.60          3.04    10.tif
10       gr1  20170404           3.4          1.40          8.01    11.tif
11       gr2  20170404           5.5          1.30          9.10    12.tif
12       gr2  20170404           9.9          0.90          7.06    13.tif
13      ctrl  20170404           3.2          0.78          8.10    14.tif
14      ctrl  20170404           1.1          1.20          7.90    15.tif
15      kjkj  20161212           3.3          4.00          5.12    16.tif
16      asht  20161212           1.2          5.00          5.23    17.tif
17      ctrl  20161212           5.4          6.00          5.15    18.tif

Now, for each experiment date I have different experimental conditions, but I always have condition named ctrl. One of normalizations I'd like to perform is to calculate the (for each numerical column) the mean of the control experiment for that date and then divide all observables from that date by their corresponding mean.

现在，对于每个实验日期，我都有不同的实验条件，但我总是将条件命名为ctrl. 我想要执行的规范化之一是计算（对于每个数字列）该日期的控制实验的平均值，然后将该日期的所有可观察值除以相应的平均值。

I can quickly calculate some of the per-date, per-condition summary statiscs using:

我可以使用以下方法快速计算一些每个日期、每个条件的摘要统计信息：

grsummary = df.groupby(["date","condition"]).agg((min, max, np.nanmean, np.nanstd))

Then I would like to apply these summary statistics in a normalization for each experiment date:

然后我想在每个实验日期的标准化中应用这些汇总统计数据：

grdate = df.groupby("date")

and apply the normalization in a fashion like this:

并以这样的方式应用规范化：

def normalize_by_ctrlmean(grp_frame, summarystats):
    #  the following is only pseudo-code as I don't know how to do this
    grp_frame/ summarystats(nanmean)

grdate.apply(normalize_by_cntrlmean, summarystats= grsummary)

The last step is only pseudo-code. This is what I'm struggling with. I could do the normalization using nested for-loops over dates, conditions, and column names of numerical columns but I'm new to the split-apply-combine paradigm and I think there must be a simple solution ? Any help greatly appreciated.

最后一步只是伪代码。这就是我正在努力解决的问题。我可以使用嵌套 for 循环对数字列的日期、条件和列名进行规范化，但我是拆分-应用-组合范例的新手，我认为必须有一个简单的解决方案？非常感谢任何帮助。

Answer 1

回答by WhoIsHyman

Here's how you can do this using df.apply:

以下是您可以使用df.apply以下方法执行此操作的方法：

Split

分裂

Since you want to perform operations 'per date', you only need to split by date:

由于您想“按日期”执行操作，您只需要按日期拆分：

grdate = df.groupby("date")

Apply & Combine

应用与组合

Next, define a transformation function that can be applied to each group, taking the group itself as an argument.

接下来，定义一个可以应用于每个组的转换函数，将组本身作为参数。

In your case, the function should compute the mean of the group's ctrlvalues and then divide all of the group's observations by this mean:

在您的情况下，该函数应计算组ctrl值的均值，然后将组的所有观察值除以该均值：

def norm_apply(group):

    # Select the 'ctrl' condition
    ctrl_selected = group[group['condition']=='ctrl']

    # Extract its numerical values
    ctrl_numeric = ctrl_selected.select_dtypes(include=[np.number])

    # Compute the means (column-wise)
    ctrl_means = np.nanmean(ctrl_numeric,axis=0) 

    # Extract numerical values for all conditions
    group_numeric = group.select_dtypes(include=[np.number])

    # Divide by the ctrl means
    divided = group_numeric / ctrl_means

    # Return result
    return divided

(You can do this as a silly one-liner, if you wish...)

（如果你愿意的话，你可以把它当作一个愚蠢的单线来做……）

norm_apply = lambda x : x.select_dtypes(include=[np.number]) / np.nanmean(x[x['condition']=='ctrl'].select_dtypes(include=[np.number]),axis=0)

Now you can simply applythis function to your grouped dataframe:

现在您可以简单地apply将此函数添加到您的分组数据框中：

normed = grdate.apply(norm_apply)

This should give you the values you need, combined into the same shape/order as your original df:

这应该给你你需要的值，组合成与原始 df 相同的形状/顺序：

normed.head()

>>   observation1  observation2  observation3
0          0.96      1.192308       1.25000
1          1.76      0.846154       0.75000
2          1.04      0.807692       0.75000
3          0.88      0.461538       1.25625
4          1.84      0.923077       1.59375

Merge into the Original DataFrame

合并到原始数据帧

One way of adding these results back to the original df would be like this:

将这些结果添加回原始 df 的一种方法是这样的：

# Add prefix to column names
normed = normed.add_prefix('normed_')

# Concatenate with initial data frame
final = pd.concat([df,normed],axis=1)
display(final.head())

And finally, you can group by date and condition and look at the means:

最后，您可以按日期和条件分组并查看方法：

final.groupby(['date','condition']).mean()

If everything worked correctly, the means for the ctlrcondition should all be 1.0.

如果一切正常，则ctlr条件均应为1.0。

(Side Note:Although Ian Thompson's answer also works, I believe this approach sticks more closely to the split-apply-combine ideology.)

（旁注：虽然 Ian Thompson 的回答也有效，但我相信这种方法更符合拆分-应用-组合的意识形态。）

Answer 2

回答by Ian Thompson

I'm kind of confused by what you are wanting with the function. I don't have enough reputation to comment so I will give my best guess to try and answer your question.

我对您想要的功能感到困惑。我没有足够的声誉来发表评论，所以我会给出我最好的猜测来尝试回答你的问题。

Seeing that your function is called normalize_by_ctrlmean, I assume that you want to always divide by the meanof the ctrlgroup for each year, within each observation. To do this we'll have to tidy up your data a bit using the meltfunction.

眼看你的函数被调用normalize_by_ctrlmean，我假设你要总是被划分mean的的ctrl组每年，每个观测范围内。为此，我们必须使用该melt函数稍微整理一下您的数据。

df1 = df.melt(id_vars = ['condition',
                         'date',
                         'rawsource'],
              value_vars = ['observation1',
                            'observation2',
                            'observation3'],
              var_name = 'observations')

df1.head()

Next we will calculate the meanfor the ctrlgroup

下一步，我们将计算出mean的ctrl组

ctrl_mean = df1[df1.condition == 'ctrl'].groupby(['date',
                                                  'observations']).agg('mean').reset_index().rename(columns = {'value' : 'ctrl_mean'})

ctrl_mean

Merge this dataframe with the melted dataframe.

将此数据框与熔化的数据框合并。

df2 = df1.merge(ctrl_mean,
                how = 'inner',
                on = ['date',
                      'observations'])

df2.head()

And lastly, divide the valuecolumn by the ctrl_meancolumn and insert into the dataframe.

最后，将value列除以ctrl_mean列并插入到数据框中。

df2.insert(df2.shape[1],
           'normalize_by_ctrlmean',
           df2.loc[:, 'value'] / df2.loc[:, 'ctrl_mean'])

df2.head()

Hope this gets you closer to what you were needing.

希望这能让你更接近你所需要的。

EDIT

编辑

Per your comment I'll show how to get back to a similar dataframe you had with observationcolumns first with the pivot_tablefunction, and then with the groupbyfunction.

根据您的评论，我将展示如何observation首先使用pivot_table函数返回到包含列的类似数据框，然后使用该groupby函数。

pivot_table

数据透视表

df2.pivot_table(index = ['date', # columns to use as the index
                   'condition',
                   'rawsource'],
          columns = 'observations', # this will make columns out of the values in this column
          values = ['value', # these will be the values in each column
                    'ctrl_mean', # swaplevel swaps the column levels (axis = 1), sort_index sorts and "smooshes" them together
                    'normalize_by_ctrlmean']).swaplevel(axis = 1).sort_index(axis = 1).reset_index() # reset_index so you can refer to specific columns

groupby

通过...分组

df2.groupby(['date', # groupby these columns to make the index
             'condition',
             'rawsource',
             'observations']).agg({'value' : 'max', # take the max of these as the aggregate (there was only one value for each so the max just returns that value)
                                   'ctrl_mean' : 'max', # unstack('observations') makes columns out of the 'observations'
                                   'normalize_by_ctrlmean' : 'max'}).unstack('observations').swaplevel(axis = 1).sort_index(axis = 1).reset_index() # these do the same thing as on the pivot_table example

In addition, you can remove the swapleveland sort_indexfunctions to keep the aggregate columns on the top level instead of the observations

此外，您可以删除swaplevel和sort_index函数以将聚合列保留在顶层而不是observations

pandas 如何使用pandas groupby()的split-apply-combine模式同时规范化多列

提问by volkerH

回答by WhoIsHyman

Split

分裂

Apply & Combine

应用与组合

Merge into the Original DataFrame

合并到原始数据帧

回答by Ian Thompson

相关推荐

最近更新

标签

pandas 如何使用pandas groupby()的split-apply-combine模式同时规范化多列

提问by volkerH

回答by WhoIsHyman

Split

分裂

Apply & Combine

应用与组合

Merge into the Original DataFrame

合并到原始数据帧

回答by Ian Thompson

相关推荐

pandas 在熊猫数据框中的特定列中查找具有空值的所有行的索引

pandas 如何在 Python 中读取和写入 CSV 文件

pandas KeyError：熊猫数据框中的错误

pandas AttributeError: 'numpy.int64' 对象没有属性 'to_timestamp'

相关推荐

最近更新

标签