pandas python中时间序列中的缺失值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49308530/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:20:03  来源:igfitidea点击:

Missing values in Time Series in python

pythonpandasnanimputation

提问by Marco Miglionico

I have a time series dataframe, the dataframe is quite big and contain some missing values in the 2 columns('Humidity' and 'Pressure'). I would like to impute this missing values in a clever way, for example using the value of the nearest neighbor or the average of the previous and following timestamp.Is there an easy way to do it? I have tried with fancyimpute but the dataset contain around 180000 examples and give a memory error enter image description here

我有一个时间序列数据框,该数据框非常大,并且在 2 列(“湿度”和“压力”)中包含一些缺失值。我想以一种巧妙的方式来估算这个缺失值,例如使用最近邻的值或前后时间戳的平均值。有没有简单的方法来做到这一点?我曾尝试过fancyimpute,但数据集包含大约 180000 个示例并给出内存错误在此处输入图片说明

回答by Peter Leimbigler

Consider interpolate(documentation). This example shows how to fill gaps of any size with a straight line:

考虑interpolate文档)。此示例显示如何用直线填充任意大小的间隙:

df = pd.DataFrame({'date': pd.date_range(start='2013-01-01', periods=10, freq='H'), 'value': range(10)})
df.loc[2:3, 'value'] = np.nan
df.loc[6, 'value'] = np.nan
df
                 date  value
0 2013-01-01 00:00:00    0.0
1 2013-01-01 01:00:00    1.0
2 2013-01-01 02:00:00    NaN
3 2013-01-01 03:00:00    NaN
4 2013-01-01 04:00:00    4.0
5 2013-01-01 05:00:00    5.0
6 2013-01-01 06:00:00    NaN
7 2013-01-01 07:00:00    7.0
8 2013-01-01 08:00:00    8.0
9 2013-01-01 09:00:00    9.0

df['value'].interpolate(method='linear', inplace=True)
                 date  value
0 2013-01-01 00:00:00    0.0
1 2013-01-01 01:00:00    1.0
2 2013-01-01 02:00:00    2.0
3 2013-01-01 03:00:00    3.0
4 2013-01-01 04:00:00    4.0
5 2013-01-01 05:00:00    5.0
6 2013-01-01 06:00:00    6.0
7 2013-01-01 07:00:00    7.0
8 2013-01-01 08:00:00    8.0
9 2013-01-01 09:00:00    9.0

回答by Scott Boston

You could use rollinglike this:

你可以这样使用rolling

frame = pd.DataFrame({'Humidity':np.arange(50,64)})

frame.loc[[3,7,10,11],'Humidity'] = np.nan

frame.Humidity.fillna(frame.Humidity.rolling(4,min_periods=1).mean())

Output:

输出:

0     50.0
1     51.0
2     52.0
3     51.0
4     54.0
5     55.0
6     56.0
7     55.0
8     58.0
9     59.0
10    58.5
11    58.5
12    62.0
13    63.0
Name: Humidity, dtype: float64

回答by Yogesh

Interpolate & Filna :

插值和菲尔纳:

Since it's Time series Question I will use o/p graph images in the answer for the explanation purpose:

由于是时间序列问题,我将在答案中使用 o/p 图图像进行解释:

Consider we are having data of time series as follows: (on x axis= number of days, y = Quantity)

考虑我们有如下时间序列数据:(x 轴 = 天数,y = 数量)

pdDataFrame.set_index('Dates')['QUANTITY'].plot(figsize = (16,6))

enter image description here

在此处输入图片说明

We can see there is some NaN data in time series. % of nan = 19.400% of total data. Now we want to impute null/nan values.

我们可以看到时间序列中有一些 NaN 数据。nan 的百分比 = 总数据的 19.400%。现在我们要估算 null/nan 值。

I will try to show you o/p of interpolate and filna methods to fill Nan values in the data.

我将尝试向您展示内插和 filna 方法的 o/p 以填充数据中的 Nan 值。

interpolate() :

插值():

1st we will use interpolate:

第一,我们将使用插值:

pdDataFrame.set_index('Dates')['QUANTITY'].interpolate(method='linear').plot(figsize = (16,6))

enter image description here

在此处输入图片说明

NOTE: There is no time method in interpolate here

注意:这里没有插值的时间方法

fillna() with backfill method

fillna() 带回填方法

pdDataFrame.set_index('Dates')['QUANTITY'].fillna(value=None, method='backfill', axis=None, limit=None, downcast=None).plot(figsize = (16,6))

enter image description here

在此处输入图片说明

fillna() with backfill method & limit = 7

fillna() 带回填方法 & 限制 = 7

limit: this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled.

限制:这是要向前/向后填充的连续 NaN 值的最大数量。换句话说,如果有超过这个连续 NaN 数量的缺口,它只会被部分填充。

pdDataFrame.set_index('Dates')['QUANTITY'].fillna(value=None, method='backfill', axis=None, limit=7, downcast=None).plot(figsize = (16,6))

enter image description here

在此处输入图片说明

I find fillna function more useful. But you can use any one of the methods to fill up nan values in both the columns.

我发现 fillna 功能更有用。但是您可以使用任何一种方法来填充两列中的 nan 值。

For more details about these functions refer following links:

有关这些功能的更多详细信息,请参阅以下链接:

  1. Filna: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.fillna.html#pandas.Series.fillna
  2. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.interpolate.html
  1. 菲尔纳:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.fillna.html#pandas.Series.fillna
  2. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.interpolate.html

There is one more Lib: impyutethat you can check out. For more details regarding this lib refer this link: https://pypi.org/project/impyute/

还有一个 Lib:impyute您可以查看。有关此库的更多详细信息,请参阅此链接:https: //pypi.org/project/impyute/

回答by Ray

Looks like your data is by hour. How about just take the average of the hour before and the hour after? Or change the window size to 2, meaning the average of two hours before and after?

看起来您的数据是按小时计算的。只取前一小时和后一小时的平均值怎么样?或者把窗口大小改成2,意思是前后两小时的平均值?

Imputing using other variables can be expensive and you should only consider those methods if the dummy methods do not work well (e.g. introducing too much noise).

使用其他变量进行估算可能会很昂贵,您应该只在虚拟方法效果不佳时才考虑这些方法(例如,引入太多噪音)。