Python Pandas:在每组中按平均值填充缺失值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19966018/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: filling missing values by mean in each group
提问by BlueFeet
This should be straightforward, but the closest thing I've found is this post: pandas: Filling missing values within a group, and I still can't solve my problem....
这应该很简单,但我发现最接近的是这篇文章: pandas: Filling missing values within a group,但我仍然无法解决我的问题......
Suppose I have the following dataframe
假设我有以下数据框
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
我想在每个“名称”组中用平均值填写“NaN”,即
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
我不知道该去哪里:
grouped = df.groupby('name').mean()
Thanks a bunch.
谢谢一堆。
回答by DSM
One way would be to use transform
:
一种方法是使用transform
:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
回答by Prajit Patil
def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)
回答by Philipp Schwarz
The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
特色高排名答案仅适用于只有两列的熊猫数据框。如果您有更多列案例,请改用:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))
回答by piRSquared
I'd do it this way
我会这样做
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
回答by André C. Andersen
@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
@DSM 为 IMO 提供了正确答案,但我想分享我对问题的概括和优化:多列分组并具有多个值列:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
……给……
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category
and name
, and impute only on value
.
在这种一般情况下,我们希望按category
和分组name
,并且只对 进行估算value
。
This can be solved as follows:
这可以解决如下:
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value
column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
注意 group-by 子句中的列列表,我们选择value
紧跟在 group-by 之后的列。这使得转换仅在该特定列上运行。您可以将它添加到最后,但随后您将对所有列运行它,只会在最后抛出除一个度量列之外的所有列。标准的 SQL 查询规划器可能已经能够优化这一点,但 pandas (0.19.2) 似乎没有做到这一点。
Performance test by increasing the dataset by doing ...
通过增加数据集进行性能测试...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
... 确认这会增加与您不必估算的列数成正比的速度:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
最后一点,如果您想估算多于一列,但不是全部,您可以进一步概括:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
.transform(lambda x: x.fillna(x.mean()))
回答by Vino Vincent
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
回答by jpp
fillna
+ groupby
+ transform
+ mean
fillna
+ groupby
+ transform
+mean
This seems intuitive:
这似乎很直观:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby
+ transform
syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambda
function.
本groupby
+transform
语法的GroupWise平均映射到原始数据帧的指数。这大致相当于@DSM 的解决方案,但避免了定义匿名lambda
函数的需要。
回答by Hardik Pachgade
You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean()))
.
您也可以使用"dataframe or table_name".apply(lambda x: x.fillna(x.mean()))
.
回答by Ashish Anand
Most of above answers involved using "groupby" and "transform" to fill the missing values.
以上大多数答案都涉及使用“groupby”和“transform”来填充缺失值。
But i prefer using "groupby" with "apply" to fill the missing values which is more intuitive to me.
但我更喜欢使用“groupby”和“apply”来填充对我来说更直观的缺失值。
>>> df['value']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
0
Shortcut: Groupby + Apply/Lambda + Fillna + Mean
快捷方式:Groupby + Apply/Lambda + Fillna + Mean
This solution still works if you want to group by multiple columns to replace missing values.
如果您想按多列分组以替换缺失值,此解决方案仍然有效。
>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})
>>> df
value name class
0 1.0 A p
1 NaN A p
2 NaN B q
3 2.0 B q
4 3.0 B r
5 NaN B r
6 NaN C s
7 4.0 C s
8 3.0 C s
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
>>> df
value name class
0 1.0 A p
1 1.0 A p
2 2.0 B q
3 2.0 B q
4 3.0 B r
5 3.0 B r
6 3.5 C s
7 4.0 C s
8 3.0 C s