Python 具有 NaN(缺失)值的 Pandas GroupBy 列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18429491/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:44:50  来源:igfitidea点击:

pandas GroupBy columns with NaN (missing) values

pythonpandasgroup-bypandas-groupbynan

提问by Gyula Sámuel Karli

I have a DataFrame with many missing values in columns which I wish to groupby:

我有一个 DataFrame 在我希望分组的列中有许多缺失值:

import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})

In [4]: df.groupby('b').groups
Out[4]: {'4': [0], '6': [2]}

see that Pandas has dropped the rows with NaN target values. (I want to include these rows!)

看到 Pandas 删除了具有 NaN 目标值的行。(我想包括这些行!)

Since I need many such operations (many cols have missing values), and use more complicated functions than just medians (typically random forests), I want to avoid writing too complicated pieces of code.

由于我需要许多这样的操作(许多列都有缺失值),并且使用比中位数更复杂的函数(通常是随机森林),我想避免编写太复杂的代码。

Any suggestions? Should I write a function for this or is there a simple solution?

有什么建议?我应该为此编写一个函数还是有一个简单的解决方案?

回答by Andy Hayden

This is mentioned in the Missing Data section of the docs:

在文档的缺失数据部分中提到

NA groups in GroupBy are automatically excluded. This behavior is consistent with R, for example.

GroupBy 中的 NA 组会被自动排除。例如,此行为与 R 一致。

One workaround is to use a placeholder before doing the groupby (e.g. -1):

一种解决方法是在执行 groupby 之前使用占位符(例如 -1):

In [11]: df.fillna(-1)
Out[11]: 
   a   b
0  1   4
1  2  -1
2  3   6

In [12]: df.fillna(-1).groupby('b').sum()
Out[12]: 
    a
b    
-1  2
4   1
6   3

That said, this feels pretty awful hack... perhaps there should be an option to include NaN in groupby (see this github issue- which uses the same placeholder hack).

也就是说,这感觉非常糟糕......也许应该有一个选项将 NaN 包含在 groupby 中(参见这个 github 问题- 它使用相同的占位符黑客)。

回答by M. Kiewisch

Ancient topic, if someone still stumbles over this--another workaround is to convert via .astype(str) to string before grouping. That will conserve the NaN's.

古老的话题,如果有人仍然遇到这个问题 - 另一个解决方法是在分组之前通过 .astype(str) 转换为字符串。这将保留 NaN。

in:
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})
df['b'] = df['b'].astype(str)
df.groupby(['b']).sum()
out:
    a
b   
4   1
6   3
nan 2

回答by Kamaraju Kusumanchi

I am not able to add a comment to M. Kiewisch since I do not have enough reputation points (only have 41 but need more than 50 to comment).

我无法向 M. Kiewisch 添加评论,因为我没有足够的声望点数(只有 41,但需要超过 50 点才能发表评论)。

Anyway, just want to point out that M. Kiewisch solution does not work as is and may need more tweaking. Consider for example

无论如何,只想指出 M. Kiewish 解决方案不能按原样工作,可能需要更多调整。考虑例如

>>> df = pd.DataFrame({'a': [1, 2, 3, 5], 'b': [4, np.NaN, 6, 4]})
>>> df
   a    b
0  1  4.0
1  2  NaN
2  3  6.0
3  5  4.0
>>> df.groupby(['b']).sum()
     a
b
4.0  6
6.0  3
>>> df.astype(str).groupby(['b']).sum()
      a
b
4.0  15
6.0   3
nan   2

which shows that for group b=4.0, the corresponding value is 15 instead of 6. Here it is just concatenating 1 and 5 as strings instead of adding it as numbers.

这表明对于 b=4.0 组,对应的值是 15 而不是 6。这里只是将 1 和 5 连接为字符串,而不是将其添加为数字。

回答by Tuetschek

One small point to Andy Hayden's solution – it doesn't work (anymore?) because np.nan == np.nanyields False, so the replacefunction doesn't actually do anything.

安迪·海登 (Andy Hayden) 的解决方案的一个小要点 – 它不起作用(不再起作用?)因为np.nan == np.nanyields False,因此该replace函数实际上没有做任何事情。

What worked for me was this:

对我有用的是:

df['b'] = df['b'].apply(lambda x: x if not np.isnan(x) else -1)

(At least that's the behavior for Pandas 0.19.2. Sorry to add it as a different answer, I do not have enough reputation to comment.)

(至少这是 Pandas 0.19.2 的行为。很抱歉将其添加为不同的答案,我没有足够的声誉来发表评论。)

回答by Brian Preslopsky

I answered this already, but some reason the answer was converted to a comment. Nevertheless, this is the most efficient solution:

我已经回答了这个问题,但出于某种原因,答案被转换为评论。尽管如此,这是最有效的解决方案:

Not being able to include (and propagate) NaNs in groups is quite aggravating. Citing R is not convincing, as this behavior is not consistent with a lot of other things. Anyway, the dummy hack is also pretty bad. However, the size (includes NaNs) and the count (ignores NaNs) of a group will differ if there are NaNs.

无法在组中包含(和传播)NaN 非常令人讨厌。引用 R 并不令人信服,因为这种行为与许多其他事情不一致。无论如何,虚拟黑客也很糟糕。但是,如果有 NaN,组的大小(包括 NaN)和计数(忽略 NaN)会有所不同。

dfgrouped = df.groupby(['b']).a.agg(['sum','size','count'])

dfgrouped['sum'][dfgrouped['size']!=dfgrouped['count']] = None

When these differ, you can set the value back to None for the result of the aggregation function for that group.

当这些不同时,您可以将该组的聚合函数的结果的值设置回无。

回答by Grant Langseth

All answers provided thus far result in potentially dangerous behavior as it is quite possible you select a dummy value that is actually part of the dataset. This is increasingly likely as you create groups with many attributes. Simply put, the approach doesn't always generalize well.

迄今为止提供的所有答案都会导致潜在的危险行为,因为您很可能选择了一个实际上是数据集一部分的虚拟值。当您创建具有许多属性的组时,这种情况越来越有可能。简而言之,这种方法并不总是能很好地概括。

A less hacky solve is to use pd.drop_duplicates() to create a unique index of value combinations each with their own ID, and then group on that id. It is more verbose but does get the job done:

一个不太难的解决方案是使用 pd.drop_duplicates() 创建一个唯一的值组合索引,每个索引都有自己的 ID,然后在该 ID 上分组。它更冗长,但确实完成了工作:

def safe_groupby(df, group_cols, agg_dict):
    # set name of group col to unique value
    group_id = 'group_id'
    while group_id in df.columns:
        group_id += 'x'
    # get final order of columns
    agg_col_order = (group_cols + list(agg_dict.keys()))
    # create unique index of grouped values
    group_idx = df[group_cols].drop_duplicates()
    group_idx[group_id] = np.arange(group_idx.shape[0])
    # merge unique index on dataframe
    df = df.merge(group_idx, on=group_cols)
    # group dataframe on group id and aggregate values
    df_agg = df.groupby(group_id, as_index=True)\
               .agg(agg_dict)
    # merge grouped value index to results of aggregation
    df_agg = group_idx.set_index(group_id).join(df_agg)
    # rename index
    df_agg.index.name = None
    # return reordered columns
    return df_agg[agg_col_order]

Note that you can now simply do the following:

请注意,您现在可以简单地执行以下操作:

data_block = [np.tile([None, 'A'], 3),
              np.repeat(['B', 'C'], 3),
              [1] * (2 * 3)]

col_names = ['col_a', 'col_b', 'value']

test_df = pd.DataFrame(data_block, index=col_names).T

grouped_df = safe_groupby(test_df, ['col_a', 'col_b'],
                          OrderedDict([('value', 'sum')]))

This will return the successful result without having to worry about overwriting real data that is mistaken as a dummy value.

这将返回成功的结果,而不必担心覆盖被误认为是虚拟值的真实数据。

回答by cs95

pandas >= 1.1

熊猫 >= 1.1

From pandas 1.1 you'll have better control over this behavior, NA values are now allowed in the grouperusing dropna=False:

从 pandas 1.1 开始,您将可以更好地控制这种行为,现在在 grouper 中允许使用NA 值dropna=False

# Example from the docs
df

   a    b  c
0  1  2.0  3
1  1  NaN  4
2  2  1.0  3
3  1  2.0  2

# without NA (the default)
df.groupby('b').sum()
     a  c
b        
1.0  2  3
2.0  2  5

# with NA
df.groupby('b', dropna=False).sum()
     a  c
b        
1.0  2  3
2.0  2  5
NaN  1  4