pandas Python 使用线性插值对不规则时间序列进行正则化
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25234941/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python regularise irregular time series with linear interpolation
提问by Diane
I have a time series in pandas that looks like this:
我在 Pandas 中有一个时间序列,如下所示:
Values
1992-08-27 07:46:48 28.0
1992-08-27 08:00:48 28.2
1992-08-27 08:33:48 28.4
1992-08-27 08:43:48 28.8
1992-08-27 08:48:48 29.0
1992-08-27 08:51:48 29.2
1992-08-27 08:53:48 29.6
1992-08-27 08:56:48 29.8
1992-08-27 09:03:48 30.0
I would like to resample it to a regular time series with 15 min times steps where the values are linearly interpolated. Basically I would like to get:
我想将其重新采样为具有 15 分钟时间步长的常规时间序列,其中值是线性插值的。基本上我想得到:
Values
1992-08-27 08:00:00 28.2
1992-08-27 08:15:00 28.3
1992-08-27 08:30:00 28.4
1992-08-27 08:45:00 28.8
1992-08-27 09:00:00 29.9
However using the resample method (df.resample('15Min')) from Pandas I get:
但是使用 Pandas 的 resample 方法 (df.resample('15Min')) 我得到:
Values
1992-08-27 08:00:00 28.20
1992-08-27 08:15:00 NaN
1992-08-27 08:30:00 28.60
1992-08-27 08:45:00 29.40
1992-08-27 09:00:00 30.00
I have tried the resample method with different 'how' and 'fill_method' parameters but never got exactly the results I wanted. Am I using the wrong method?
我已经尝试过使用不同的“how”和“fill_method”参数的 resample 方法,但从未得到我想要的结果。我使用了错误的方法吗?
I figure this is a fairly simple query, but I have searched the web for a while and couldn't find an answer.
我认为这是一个相当简单的查询,但我已经在网上搜索了一段时间并找不到答案。
Thanks in advance for any help I can get.
在此先感谢我能得到的任何帮助。
采纳答案by chrisb
It takes a bit of work, but try this out. Basic idea is find the closest two timestamps to each resample point and interpolate. np.searchsortedis used to find dates closest to the resample point.
这需要一些工作,但试试这个。基本思想是找到最接近每个重采样点的两个时间戳并进行插值。 np.searchsorted用于查找最接近重采样点的日期。
# empty frame with desired index
rs = pd.DataFrame(index=df.resample('15min').iloc[1:].index)
# array of indexes corresponding with closest timestamp after resample
idx_after = np.searchsorted(df.index.values, rs.index.values)
# values and timestamp before/after resample
rs['after'] = df.loc[df.index[idx_after], 'Values'].values
rs['before'] = df.loc[df.index[idx_after - 1], 'Values'].values
rs['after_time'] = df.index[idx_after]
rs['before_time'] = df.index[idx_after - 1]
#calculate new weighted value
rs['span'] = (rs['after_time'] - rs['before_time'])
rs['after_weight'] = (rs['after_time'] - rs.index) / rs['span']
# I got errors here unless I turn the index to a series
rs['before_weight'] = (pd.Series(data=rs.index, index=rs.index) - rs['before_time']) / rs['span']
rs['Values'] = rs.eval('before * before_weight + after * after_weight')
After all that, hopefully the right answer:
毕竟,希望是正确的答案:
In [161]: rs['Values']
Out[161]:
1992-08-27 08:00:00 28.011429
1992-08-27 08:15:00 28.313939
1992-08-27 08:30:00 28.223030
1992-08-27 08:45:00 28.952000
1992-08-27 09:00:00 29.908571
Freq: 15T, Name: Values, dtype: float64
回答by mstringer
You can do this with traces. First, create a TimeSerieswith your irregular measurements like you would a dictionary:
你可以用traces做到这一点。首先,TimeSeries像使用字典一样使用不规则的测量值创建一个:
ts = traces.TimeSeries([
(datetime(1992, 8, 27, 7, 46, 48), 28.0),
(datetime(1992, 8, 27, 8, 0, 48), 28.2),
...
(datetime(1992, 8, 27, 9, 3, 48), 30.0),
])
Then regularize using the samplemethod:
然后使用以下sample方法进行正则化:
ts.sample(
sampling_period=timedelta(minutes=15),
start=datetime(1992, 8, 27, 8),
end=datetime(1992, 8, 27, 9),
interpolate='linear',
)
This results in the following regularized version, where the gray dots are the original data and the orange is the regularized version with linear interpolation.
这导致了以下正则化版本,其中灰色点是原始数据,橙色是具有线性插值的正则化版本。
The interpolated values are:
插值是:
1992-08-27 08:00:00 28.189
1992-08-27 08:15:00 28.286
1992-08-27 08:30:00 28.377
1992-08-27 08:45:00 28.848
1992-08-27 09:00:00 29.891
回答by Alberto Garcia-Raboso
The same result that @mstringer gets can be achieved purely in pandas. The trick is to first resample by second, using interpolation to fill in the intermediate values (.resample('s').interpolate()), and then upsample in 15-minute periods (.resample('15T').asfreq()).
@mstringer 得到的结果完全可以在 Pandas 中实现。诀窍是首先按秒重新采样,使用插值填充中间值 ( .resample('s').interpolate()),然后在 15 分钟的时间段 ( .resample('15T').asfreq())进行上采样。
import io
import pandas as pd
data = io.StringIO('''\
Values
1992-08-27 07:46:48,28.0
1992-08-27 08:00:48,28.2
1992-08-27 08:33:48,28.4
1992-08-27 08:43:48,28.8
1992-08-27 08:48:48,29.0
1992-08-27 08:51:48,29.2
1992-08-27 08:53:48,29.6
1992-08-27 08:56:48,29.8
1992-08-27 09:03:48,30.0
''')
s = pd.read_csv(data, squeeze=True)
s.index = pd.to_datetime(s.index)
res = s.resample('s').interpolate().resample('15T').asfreq().dropna()
print(res)
Output:
输出:
1992-08-27 08:00:00 28.188571
1992-08-27 08:15:00 28.286061
1992-08-27 08:30:00 28.376970
1992-08-27 08:45:00 28.848000
1992-08-27 09:00:00 29.891429
Freq: 15T, Name: Values, dtype: float64
回答by BE-Bob
I recently had to resample acceleration data that was non-uniformly sampled. It was generally sampled at the correct frequency, but had delays intermittently that accumulated.
我最近不得不重新采样非均匀采样的加速度数据。它通常以正确的频率采样,但会间歇性地累积延迟。
I found this question and combined mstringer's and Alberto Garcia-Rabosco's answers using pure pandas and numpy. This method creates a new index at the desired frequency and then interpolates without the intermittent step of interpolating at higher frequency.
我发现了这个问题,并使用纯Pandas和 numpy 结合了 mstringer 和 Alberto Garcia-Rabosco 的答案。此方法在所需频率处创建一个新索引,然后进行内插,而无需以较高频率进行内插的间歇步骤。
# from Alberto Garcia-Rabosco above
import io
import pandas as pd
data = io.StringIO('''\
Values
1992-08-27 07:46:48,28.0
1992-08-27 08:00:48,28.2
1992-08-27 08:33:48,28.4
1992-08-27 08:43:48,28.8
1992-08-27 08:48:48,29.0
1992-08-27 08:51:48,29.2
1992-08-27 08:53:48,29.6
1992-08-27 08:56:48,29.8
1992-08-27 09:03:48,30.0
''')
s = pd.read_csv(data, squeeze=True)
s.index = pd.to_datetime(s.index)
Code to do the interpolation:
进行插值的代码:
import numpy as np
# create the new index and a new series full of NaNs
new_index = pd.DatetimeIndex(start='1992-08-27 08:00:00',
freq='15 min', periods=5, yearfirst=True)
new_series = pd.Series(np.nan, index=new_index)
# concat the old and new series and remove duplicates (if any)
comb_series = pd.concat([s, new_series])
comb_series = comb_series[~comb_series.index.duplicated(keep='first')]
# interpolate to fill the NaNs
comb_series.interpolate(method='time', inplace=True)
Output:
输出:
>>> print(comb_series[new_index])
1992-08-27 08:00:00 28.188571
1992-08-27 08:15:00 28.286061
1992-08-27 08:30:00 28.376970
1992-08-27 08:45:00 28.848000
1992-08-27 09:00:00 29.891429
Freq: 15T, dtype: float64
As before, you can use whatever interpolation method that scipy supports and this technique works with DataFrames as well (that is what I originally used it for). Finally, note that interpolate defaults to the 'linear' method which ignores the time information in the index and will not work with non-uniformly spaced data.
和以前一样,您可以使用 scipy 支持的任何插值方法,并且该技术也适用于 DataFrames(这就是我最初使用它的目的)。最后,请注意 interpolate 默认为“线性”方法,该方法忽略索引中的时间信息,并且不适用于非均匀间隔的数据。


