Pandas:如何用 groupby 的平均值填充空值?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40299055/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:17:59  来源:igfitidea点击:

Pandas: How to fill null values with mean of a groupby?

pythonpandasmissing-dataimputation

提问by sfactor

I have a dataset will some missing data that looks like this:

我有一个数据集会丢失一些看起来像这样的数据:

id    category     value
1     A            NaN
2     B            NaN
3     A            10.5
4     C            NaN
5     A            2.0
6     B            1.0

I need to fill in the nulls to use the data in a model. Every time a category occurs for the first time it is NULL. The way I want to do is for cases like category Aand Bthat have more than one value replace the nulls with the average of that category. And for category Cwith only single occurrence just fill in the average of the rest of the data.

我需要填写空值才能在模型中使用数据。每次第一次出现类别时,它都是 NULL。我想要做的方法是针对诸如类别之类的情况AB并且具有多个值的情况用该类别的平均值替换空值。对于C仅出现一次的类别,只需填写其余数据的平均值。

I know that I can simply do this for cases like Cto get the average of all the rows but I'm stuck trying to do the categorywise means for A and B and replacing the nulls.

我知道我可以简单地C在获取所有行的平均值的情况下执行此操作,但我一直在尝试对 A 和 B 执行类别均值并替换空值。

df['value'] = df['value'].fillna(df['value'].mean()) 

I need the final df to be like this

我需要最终的 df 是这样的

id    category     value
1     A            6.25
2     B            1.0
3     A            10.5
4     C            4.15
5     A            2.0
6     B            1.0

采纳答案by jezrael

I think you can use groupbyand applyfillnawith mean. Then get NaNif some category has only NaNvalues, so use meanof all values of column for filling NaN:

我想你可以使用groupbyapplyfillna使用mean。然后获取NaN是否某个类别只有NaN值,因此使用mean列的所有值进行填充NaN

df.value = df.groupby('category')['value'].apply(lambda x: x.fillna(x.mean()))
df.value = df.value.fillna(df.value.mean())
print (df)
   id category  value
0   1        A   6.25
1   2        B   1.00
2   3        A  10.50
3   4        C   4.15
4   5        A   2.00
5   6        B   1.00

回答by jpp

You can also use GroupBy+ transformto fill NaNvalues with groupwise means. This method avoids inefficient apply+ lambda. For example:

您还可以使用GroupBy+transformNaN分组方式填充值。这种方法避免了效率低下的apply+ lambda。例如:

df['value'] = df['value'].fillna(df.groupby('category')['value'].transform('mean'))
df['value'] = df['value'].fillna(df['value'].mean())