Python Pandas 时间序列插值和正则化
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30530001/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python pandas time series interpolation and regularization
提问by riccamini
I am using Python Pandas for the first time. I have 5-min lag traffic data in csv format:
我第一次使用 Python Pandas。我有 csv 格式的 5 分钟滞后流量数据:
...
2015-01-04 08:29:05,271238
2015-01-04 08:34:05,329285
2015-01-04 08:39:05,-1
2015-01-04 08:44:05,260260
2015-01-04 08:49:05,263711
...
There are several issues:
有几个问题:
- for some timestamps there's missing data (-1)
- missing entries (also 2/3 consecutive hours)
- the frequency of the observations is not exactly 5 minutes, but actually loses some seconds once in a while
- 对于某些时间戳,缺少数据 (-1)
- 缺少条目(也是连续 2/3 小时)
- 观察的频率不完全是 5 分钟,但实际上偶尔会丢失几秒钟
I would like to obtain a regular time series, so with entries every (exactly) 5 minutes (and no missing valus). I have successfully interpolated the time series with the following code to approximate the -1 values with this code:
我想获得一个常规的时间序列,因此每(恰好)5 分钟(并且没有缺失值)输入一次。我已使用以下代码成功插入时间序列,以使用此代码近似 -1 值:
ts = pd.TimeSeries(values, index=timestamps)
ts.interpolate(method='cubic', downcast='infer')
How can I both interpolate and regularize the frequency of the observations? Thank you all for the help.
我怎样才能对观察的频率进行插值和正则化?谢谢大家的帮助。
回答by unutbu
Change the -1s to NaNs:
将-1s更改为 NaN:
ts[ts==-1] = np.nan
Then resample the data to have a 5 minute frequency.
然后重新采样数据以具有 5 分钟的频率。
ts = ts.resample('5T')
Note that, by default, if two measurements fall within the same 5 minute period, resampleaverages the values together.
请注意,默认情况下,如果两个测量值落在同一 5 分钟时间内,resample则将这些值一起计算平均值。
Finally, you could linearly interpolate the time series according to the time:
最后,您可以根据时间对时间序列进行线性插值:
ts = ts.interpolate(method='time')
Since it looks like your data already has roughly a 5-minute frequency, you might need to resample at a shorter frequency so cubic or spline interpolation can smooth out the curve:
由于看起来您的数据已经有大约 5 分钟的频率,您可能需要以较短的频率重新采样,以便三次或样条插值可以平滑曲线:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
values = [271238, 329285, -1, 260260, 263711]
timestamps = pd.to_datetime(['2015-01-04 08:29:05',
'2015-01-04 08:34:05',
'2015-01-04 08:39:05',
'2015-01-04 08:44:05',
'2015-01-04 08:49:05'])
ts = pd.Series(values, index=timestamps)
ts[ts==-1] = np.nan
ts = ts.resample('T').mean()
ts.interpolate(method='spline', order=3).plot()
ts.interpolate(method='time').plot()
lines, labels = plt.gca().get_legend_handles_labels()
labels = ['spline', 'time']
plt.legend(lines, labels, loc='best')
plt.show()



