pandas 用滚动平均值或其他插值替换 NaN 或缺失值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25234782/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:21:26  来源:igfitidea点击:

Replace NaN or missing values with rolling mean or other interpolation

pythonpandasmissing-datamoving-average

提问by Alexis Eggermont

I have a pandas dataframe with monthly data that I want to compute a 12 months moving average for. Data for for every month of January is missing, however (NaN), so I am using

我有一个包含每月数据的 Pandas 数据框,我想为其计算 12 个月的移动平均值。一月份的每个月的数据都丢失了,但是(NaN),所以我使用

pd.rolling_mean(data["variable"]), 12, center=True)

but it just gives me all NaN values.

但它只是给了我所有的 NaN 值。

Is there a simple way that I can ignore the NaN values? I understand that in practice this would become a 11-month moving average.

有没有一种简单的方法可以忽略 NaN 值?我知道在实践中这将成为 11 个月的移动平均线。

The dataframe has other variables which have January data, so I don't want to just throw out the January columns and do an 11 month moving average.

数据框还有其他具有一月数据的变量,所以我不想只是扔掉一月的列并做一个 11 个月的移动平均线。

回答by JohnE

There are several ways to approach this, and the best way will depend on whether the January data is systematically different from other months. Most real-world data is likely to be somewhat seasonal, so let's use the average high temperature (Fahrenheit) of a random city in the northern hemisphere as an example.

有几种方法可以解决这个问题,最好的方法将取决于 1 月份的数据是否与其他月份有系统的不同。大多数真实世界的数据可能都有些季节性,所以我们以北半球随机城市的平均高温(华氏度)为例。

df=pd.DataFrame({ 'month' : [10,11,12,1,2,3],
                  'temp'  : [65,50,45,np.nan,40,43] }).set_index('month')

You could use a rolling mean as you suggest, but the issue is that you will get an average temperature over the entire year, which ignores the fact that January is the coldest month. To correct for this, you could reduce the window to 3, which results in the January temp being the average of the December and February temps. (I am also using min_periods=1as suggested in @user394430's answer.)

您可以按照您的建议使用滚动平均值,但问题是您将获得全年的平均温度,这忽略了 1 月是最冷月份这一事实。要对此进行更正,您可以将窗口减少到 3,从而使 1 月温度成为 12 月和 2 月温度的平均值。(我也min_periods=1按照@user394430 的回答中的建议使用。)

df['rollmean12'] = df['temp'].rolling(12,center=True,min_periods=1).mean()
df['rollmean3']  = df['temp'].rolling( 3,center=True,min_periods=1).mean()

Those are improvements but still have the problem of overwriting existing values with rolling means. To avoid this you could combine with the update()method (see documentation here).

这些都是改进,但仍然存在用滚动方式覆盖现有值的问题。为了避免这种情况,您可以结合使用该update()方法(请参阅此处的文档)。

df['update'] = df['rollmean3']
df['update'].update( df['temp'] )  # note: this is an inplace operation

There are even simpler approaches that leave the existing values alone while filling the missing January temps with either the previous month, next month, or the mean of the previous and next month.

还有更简单的方法,可以单独保留现有值,同时用上个月、下个月或上个月和下个月的平均值填充缺失的 1 月温度。

df['ffill']   = df['temp'].ffill()         # previous month 
df['bfill']   = df['temp'].bfill()         # next month
df['interp']  = df['temp'].interpolate()   # mean of prev/next

In this case, interpolate()defaults to simple linear interpretation, but you have several other intepolation options also. See documentation on pandas interpolatefor more info. Or this statck overflow question: Interpolation on DataFrame in pandas

在这种情况下,interpolate()默认为简单线性解释,但您还有其他几个插值选项。有关更多信息,请参阅有关 Pandas interpolate 的文档。或者这个 statck 溢出问题: Interpolation on DataFrame in pandas

Here is the sample data with all the results:

以下是包含所有结果的示例数据:

       temp  rollmean12  rollmean3  update  ffill  bfill  interp
month                                                           
10     65.0        48.6  57.500000    65.0   65.0   65.0    65.0
11     50.0        48.6  53.333333    50.0   50.0   50.0    50.0
12     45.0        48.6  47.500000    45.0   45.0   45.0    45.0
1       NaN        48.6  42.500000    42.5   45.0   40.0    42.5
2      40.0        48.6  41.500000    40.0   40.0   40.0    40.0
3      43.0        48.6  41.500000    43.0   43.0   43.0    43.0

In particular, note that "update" and "interp" give the same results in all months. While it doesn't matter which one you use here, in other cases one way or the other might be better.

特别要注意,“update”和“interp”在所有月份都给出了相同的结果。虽然在这里使用哪一种并不重要,但在其他情况下,一种或另一种方式可能会更好。

回答by user394430

The real key is having min_periods=1. Also, as of version 18, the proper calling is with a Rolling object. Therefore, your code should be

真正的关键是拥有min_periods=1. 此外,从版本 18 开始,正确的调用是使用Rolling 对象。因此,您的代码应该是

data["variable"].rolling(min_periods=1, center=True, window=12).mean().

data["variable"].rolling(min_periods=1, center=True, window=12).mean().