pandas 熊猫:填充组内的缺失值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18265930/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:05:31  来源:igfitidea点击:

pandas: Filling missing values within a group

pythonpandas

提问by Marius

I have some data from an experiment, and within each trial there are some single values, surrounded by NA's, that I want to fill out to the entire trial:

我有一些来自实验的数据,在每个试验中,有一些单独的值,用NA's包围,我想填写到整个试验中:

df = pd.DataFrame({'trial': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], 
    'cs_name': [np.nan, 'A1', np.nan, np.nan, np.nan, np.nan, 'B2', 
                np.nan, 'A1', np.nan, np.nan, np.nan]})
Out[177]: 
   cs_name  trial
0      NaN      1
1       A1      1
2      NaN      1
3      NaN      1
4      NaN      2
5      NaN      2
6       B2      2
7      NaN      2
8       A1      3
9      NaN      3
10     NaN      3
11     NaN      3

I'm able to fill these values within the whole trial by using both bfill()and ffill(), but I'm wondering if there is a better way to achieve this.

我可以通过同时使用bfill()和来在整个试验中填充这些值ffill(),但我想知道是否有更好的方法来实现这一点。

df['cs_name'] = df.groupby('trial')['cs_name'].ffill()
df['cs_name'] = df.groupby('trial')['cs_name'].bfill()

Expected output:

预期输出:

   cs_name  trial
0       A1      1
1       A1      1
2       A1      1
3       A1      1
4       B2      2
5       B2      2
6       B2      2
7       B2      2
8       A1      3
9       A1      3
10      A1      3
11      A1      3

回答by Andy Hayden

An alternative approach is to use first_valid_indexand a transform:

另一种方法是使用first_valid_index和 a transform

In [11]: g = df.groupby('trial')

In [12]: g['cs_name'].transform(lambda s: s.loc[s.first_valid_index()])
Out[12]: 
0     A1
1     A1
2     A1
3     A1
4     B2
5     B2
6     B2
7     B2
8     A1
9     A1
10    A1
11    A1
Name: cs_name, dtype: object

This ought to be more efficient then using ffill followed by a bfill...

这应该比使用 ffill 后跟 bfill 更有效......

And use this to change the cs_namecolumn:

并使用它来更改cs_name列:

df['cs_name'] = g['cs_name'].transform(lambda s: s.loc[s.first_valid_index()])

Note: I think it would be nice enhancement to have a method to grab the first non-null object in the pandas, in numpy it's an open request, I don't think there is currently a method (I could be wrong!)...

注意:我认为有一个方法来获取 Pandas 中的第一个非空对象会是一个很好的增强,在 numpy 中它是一个开放请求,我认为目前没有一个方法(我可能是错的!)。 ..

回答by Federico De Cillia

If you want to avoid the error that appears when some groups contain only NaN you could do the following (Note that I changed the df so there are only Nan for the group having trial=1):

如果您想避免在某些组仅包含 NaN 时出现的错误,您可以执行以下操作(请注意,我更改了 df,因此 Trial=1 的组只有 Nan):

df = pd.DataFrame({'trial': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3,1,1], 
'cs_name': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 'B2', np.nan, 
'A3', np.nan, np.nan, np.nan, np.nan,np.nan]})

g = data.groupby('trial')

g['cs_name'].transform(lambda s: 'No values to aggregate' if 
    pd.isnull(s).all() == True else s.loc[s.first_valid_index()])

df['cs_name'] = g['cs_name'].transform(lambda s: 'No values to aggregate' if 
    pd.isnull(s).all() == True else s.loc[s.first_valid_index()])`

This way you input 'No Values to aggregate' (or whatever you want) when the program finds all NaN for a particular group, instead of an error.

这样,当程序找到特定组的所有 NaN 时,您可以输入“没有要聚合的值”(或您想要的任何值),而不是错误。

Hope this helps :)

希望这可以帮助 :)

Federico

费德里科