Python Pandas:在每组中按平均值填充缺失值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19966018/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 15:06:03  来源:igfitidea点击:

Pandas: filling missing values by mean in each group

pythonpandaspandas-groupbyimputationfillna

提问by BlueFeet

This should be straightforward, but the closest thing I've found is this post: pandas: Filling missing values within a group, and I still can't solve my problem....

这应该很简单,但我发现最接近的是这篇文章: pandas: Filling missing values within a group,但我仍然无法解决我的问题......

Suppose I have the following dataframe

假设我有以下数据框

df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})

  name  value
0    A      1
1    A    NaN
2    B    NaN
3    B      2
4    B      3
5    B      1
6    C      3
7    C    NaN
8    C      3

and I'd like to fill in "NaN" with mean value in each "name" group, i.e.

我想在每个“名称”组中用平均值填写“NaN”,即

      name  value
0    A      1
1    A      1
2    B      2
3    B      2
4    B      3
5    B      1
6    C      3
7    C      3
8    C      3

I'm not sure where to go after:

我不知道该去哪里:

grouped = df.groupby('name').mean()

Thanks a bunch.

谢谢一堆。

回答by DSM

One way would be to use transform:

一种方法是使用transform

>>> df
  name  value
0    A      1
1    A    NaN
2    B    NaN
3    B      2
4    B      3
5    B      1
6    C      3
7    C    NaN
8    C      3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
  name  value
0    A      1
1    A      1
2    B      2
3    B      2
4    B      3
5    B      1
6    C      3
7    C      3
8    C      3

回答by Prajit Patil

def groupMeanValue(group):
    group['value'] = group['value'].fillna(group['value'].mean())
    return group

dft = df.groupby("name").transform(groupMeanValue)

回答by Philipp Schwarz

The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:

特色高排名答案仅适用于只有两列的熊猫数据框。如果您有更多列案例,请改用:

df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
    lambda x: x.fillna(x.mean()))

回答by piRSquared

I'd do it this way

我会这样做

df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')

回答by André C. Andersen

@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:

@DSM 为 IMO 提供了正确答案,但我想分享我对问题的概括和优化:多列分组并具有多个值列:

df = pd.DataFrame(
    {
        'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
        'name': ['A','A', 'B','B','B','B', 'C','C','C'],
        'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
        'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
    }
)

... gives ...

……给……

  category name  other_value value
0        X    A         10.0   1.0
1        X    A          NaN   NaN
2        X    B          NaN   NaN
3        X    B         20.0   2.0
4        X    B         30.0   3.0
5        X    B         10.0   1.0
6        Y    C         30.0   3.0
7        Y    C          NaN   NaN
8        Y    C         30.0   3.0

In this generalized case we would like to group by categoryand name, and impute only on value.

在这种一般情况下,我们希望按category和分组name,并且只对 进行估算value

This can be solved as follows:

这可以解决如下:

df['value'] = df.groupby(['category', 'name'])['value']\
    .transform(lambda x: x.fillna(x.mean()))

Notice the column list in the group-by clause, and that we select the valuecolumn right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.

注意 group-by 子句中的列列表,我们选择value紧跟在 group-by 之后的列。这使得转换仅在该特定列上运行。您可以将它添加到最后,但随后您将对所有列运行它,只会在最后抛出除一个度量列之外的所有列。标准的 SQL 查询规划器可能已经能够优化这一点,但 pandas (0.19.2) 似乎没有做到这一点。

Performance test by increasing the dataset by doing ...

通过增加数据集进行性能测试...

big_df = None
for _ in range(10000):
    if big_df is None:
        big_df = df.copy()
    else:
        big_df = pd.concat([big_df, df])
df = big_df

... confirms that this increases the speed proportional to how many columns you don't have to impute:

... 确认这会增加与您不必估算的列数成正比的速度:

import pandas as pd
from datetime import datetime

def generate_data():
    ...

t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
    .transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)

# 0:00:00.016012

t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
    .transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)

# 0:00:00.030022

On a final note you can generalize even further if you want to impute more than one column, but not all:

最后一点,如果您想估算多于一列,但不是全部,您可以进一步概括:

df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
    .transform(lambda x: x.fillna(x.mean()))

回答by Vino Vincent

df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)

回答by jpp

fillna+ groupby+ transform+ mean

fillna+ groupby+ transform+mean

This seems intuitive:

这似乎很直观:

df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))

The groupby+ transformsyntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambdafunction.

groupby+transform语法的GroupWise平均映射到原始数据帧的指数。这大致相当于@DSM 的解决方案,但避免了定义匿名lambda函数的需要。

回答by Hardik Pachgade

You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).

您也可以使用"dataframe or table_name".apply(lambda x: x.fillna(x.mean())).

回答by Ashish Anand

Most of above answers involved using "groupby" and "transform" to fill the missing values.

以上大多数答案都涉及使用“groupby”和“transform”来填充缺失值。

But i prefer using "groupby" with "apply" to fill the missing values which is more intuitive to me.

但我更喜欢使用“groupby”和“apply”来填充对我来说更直观的缺失值。

>>> df['value']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
    0 

Shortcut: Groupby + Apply/Lambda + Fillna + Mean

快捷方式:Groupby + Apply/Lambda + Fillna + Mean

This solution still works if you want to group by multiple columns to replace missing values.

如果您想按多列分组以替换缺失值,此解决方案仍然有效。

     >>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3], 
    'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})  

     >>> df
   value name   class
0    1.0    A     p
1    NaN    A     p
2    NaN    B     q
3    2.0    B     q
4    3.0    B     r
5    NaN    B     r
6    NaN    C     s
7    4.0    C     s
8    3.0    C     s

>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))

>>> df
        value name   class
    0    1.0    A     p
    1    1.0    A     p
    2    2.0    B     q
    3    2.0    B     q
    4    3.0    B     r
    5    3.0    B     r
    6    3.5    C     s
    7    4.0    C     s
    8    3.0    C     s