Python 熊猫:按时间间隔滚动平均值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15771472/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: rolling mean by time interval
提问by Anov
I'm new to Pandas.... I've got a bunch of polling data; I want to compute a rolling mean to get an estimate for each day based on a three-day window. As I understand from this question, the rolling_* functions compute the window based on a specified number of values, and not a specific datetime range.
我是 Pandas 的新手……我有一堆民意调查数据;我想计算一个滚动平均值,以根据三天的窗口来估计每一天。正如我从这个问题中了解到的,rolling_* 函数根据指定数量的值而不是特定的日期时间范围计算窗口。
Is there a different function that implements this functionality? Or am I stuck writing my own?
是否有实现此功能的不同功能?还是我一直在写自己的?
EDIT:
编辑:
Sample input data:
示例输入数据:
polls_subset.tail(20)
Out[185]:
favorable unfavorable other
enddate
2012-10-25 0.48 0.49 0.03
2012-10-25 0.51 0.48 0.02
2012-10-27 0.51 0.47 0.02
2012-10-26 0.56 0.40 0.04
2012-10-28 0.48 0.49 0.04
2012-10-28 0.46 0.46 0.09
2012-10-28 0.48 0.49 0.03
2012-10-28 0.49 0.48 0.03
2012-10-30 0.53 0.45 0.02
2012-11-01 0.49 0.49 0.03
2012-11-01 0.47 0.47 0.05
2012-11-01 0.51 0.45 0.04
2012-11-03 0.49 0.45 0.06
2012-11-04 0.53 0.39 0.00
2012-11-04 0.47 0.44 0.08
2012-11-04 0.49 0.48 0.03
2012-11-04 0.52 0.46 0.01
2012-11-04 0.50 0.47 0.03
2012-11-05 0.51 0.46 0.02
2012-11-07 0.51 0.41 0.00
Output would have only one row for each date.
每个日期的输出只有一行。
EDIT x2: fixed typo
编辑 x2:固定错字
采纳答案by Martin
In the meantime, a time-window capability was added. See this link.
同时,添加了时间窗口功能。请参阅此链接。
In [1]: df = DataFrame({'B': range(5)})
In [2]: df.index = [Timestamp('20130101 09:00:00'),
...: Timestamp('20130101 09:00:02'),
...: Timestamp('20130101 09:00:03'),
...: Timestamp('20130101 09:00:05'),
...: Timestamp('20130101 09:00:06')]
In [3]: df
Out[3]:
B
2013-01-01 09:00:00 0
2013-01-01 09:00:02 1
2013-01-01 09:00:03 2
2013-01-01 09:00:05 3
2013-01-01 09:00:06 4
In [4]: df.rolling(2, min_periods=1).sum()
Out[4]:
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:05 5.0
2013-01-01 09:00:06 7.0
In [5]: df.rolling('2s', min_periods=1).sum()
Out[5]:
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:05 3.0
2013-01-01 09:00:06 7.0
回答by Zelazny7
What about something like this:
这样的事情怎么样:
First resample the data frame into 1D intervals. This takes the mean of the values for all duplicate days. Use the fill_methodoption to fill in missing date values. Next, pass the resampled frame into pd.rolling_meanwith a window of 3 and min_periods=1 :
首先将数据帧重新采样为一维间隔。这取所有重复天数的平均值。使用该fill_method选项来填充缺失的日期值。接下来,将重新采样的帧传递到pd.rolling_mean窗口为 3 且 min_periods=1 :
pd.rolling_mean(df.resample("1D", fill_method="ffill"), window=3, min_periods=1)
favorable unfavorable other
enddate
2012-10-25 0.495000 0.485000 0.025000
2012-10-26 0.527500 0.442500 0.032500
2012-10-27 0.521667 0.451667 0.028333
2012-10-28 0.515833 0.450000 0.035833
2012-10-29 0.488333 0.476667 0.038333
2012-10-30 0.495000 0.470000 0.038333
2012-10-31 0.512500 0.460000 0.029167
2012-11-01 0.516667 0.456667 0.026667
2012-11-02 0.503333 0.463333 0.033333
2012-11-03 0.490000 0.463333 0.046667
2012-11-04 0.494000 0.456000 0.043333
2012-11-05 0.500667 0.452667 0.036667
2012-11-06 0.507333 0.456000 0.023333
2012-11-07 0.510000 0.443333 0.013333
UPDATE: As Ben points out in the comments, with pandas 0.18.0 the syntax has changed. With the new syntax this would be:
更新:正如 Ben 在评论中指出的那样,pandas 0.18.0 的语法已经改变。使用新语法,这将是:
df.resample("1d").sum().fillna(0).rolling(window=3, min_periods=1).mean()
回答by user2689410
I just had the same question but with irregularly spaced datapoints. Resample is not really an option here. So I created my own function. Maybe it will be useful for others too:
我只是有同样的问题,但数据点间隔不规则。重新采样在这里并不是一个真正的选择。所以我创建了自己的函数。也许它对其他人也有用:
from pandas import Series, DataFrame
import pandas as pd
from datetime import datetime, timedelta
import numpy as np
def rolling_mean(data, window, min_periods=1, center=False):
''' Function that computes a rolling mean
Parameters
----------
data : DataFrame or Series
If a DataFrame is passed, the rolling_mean is computed for all columns.
window : int or string
If int is passed, window is the number of observations used for calculating
the statistic, as defined by the function pd.rolling_mean()
If a string is passed, it must be a frequency string, e.g. '90S'. This is
internally converted into a DateOffset object, representing the window size.
min_periods : int
Minimum number of observations in window required to have a value.
Returns
-------
Series or DataFrame, if more than one column
'''
def f(x):
'''Function to apply that actually computes the rolling mean'''
if center == False:
dslice = col[x-pd.datetools.to_offset(window).delta+timedelta(0,0,1):x]
# adding a microsecond because when slicing with labels start and endpoint
# are inclusive
else:
dslice = col[x-pd.datetools.to_offset(window).delta/2+timedelta(0,0,1):
x+pd.datetools.to_offset(window).delta/2]
if dslice.size < min_periods:
return np.nan
else:
return dslice.mean()
data = DataFrame(data.copy())
dfout = DataFrame()
if isinstance(window, int):
dfout = pd.rolling_mean(data, window, min_periods=min_periods, center=center)
elif isinstance(window, basestring):
idx = Series(data.index.to_pydatetime(), index=data.index)
for colname, col in data.iterkv():
result = idx.apply(f)
result.name = colname
dfout = dfout.join(result, how='outer')
if dfout.columns.size == 1:
dfout = dfout.ix[:,0]
return dfout
# Example
idx = [datetime(2011, 2, 7, 0, 0),
datetime(2011, 2, 7, 0, 1),
datetime(2011, 2, 7, 0, 1, 30),
datetime(2011, 2, 7, 0, 2),
datetime(2011, 2, 7, 0, 4),
datetime(2011, 2, 7, 0, 5),
datetime(2011, 2, 7, 0, 5, 10),
datetime(2011, 2, 7, 0, 6),
datetime(2011, 2, 7, 0, 8),
datetime(2011, 2, 7, 0, 9)]
idx = pd.Index(idx)
vals = np.arange(len(idx)).astype(float)
s = Series(vals, index=idx)
rm = rolling_mean(s, window='2min')
回答by Mark Horvath
user2689410's code was exactly what I needed. Providing my version (credits to user2689410), which is faster due to calculating mean at once for whole rows in the DataFrame.
user2689410 的代码正是我所需要的。提供我的版本(归功于 user2689410),由于一次计算 DataFrame 中整行的平均值,所以速度更快。
Hope my suffix conventions are readable: _s: string, _i: int, _b: bool, _ser: Series and _df: DataFrame. Where you find multiple suffixes, type can be both.
希望我的后缀约定是可读的:_s:字符串,_i:int,_b:bool,_ser:Series 和 _df:DataFrame。在您找到多个后缀的情况下,类型可以是两者。
import pandas as pd
from datetime import datetime, timedelta
import numpy as np
def time_offset_rolling_mean_df_ser(data_df_ser, window_i_s, min_periods_i=1, center_b=False):
""" Function that computes a rolling mean
Credit goes to user2689410 at http://stackoverflow.com/questions/15771472/pandas-rolling-mean-by-time-interval
Parameters
----------
data_df_ser : DataFrame or Series
If a DataFrame is passed, the time_offset_rolling_mean_df_ser is computed for all columns.
window_i_s : int or string
If int is passed, window_i_s is the number of observations used for calculating
the statistic, as defined by the function pd.time_offset_rolling_mean_df_ser()
If a string is passed, it must be a frequency string, e.g. '90S'. This is
internally converted into a DateOffset object, representing the window_i_s size.
min_periods_i : int
Minimum number of observations in window_i_s required to have a value.
Returns
-------
Series or DataFrame, if more than one column
>>> idx = [
... datetime(2011, 2, 7, 0, 0),
... datetime(2011, 2, 7, 0, 1),
... datetime(2011, 2, 7, 0, 1, 30),
... datetime(2011, 2, 7, 0, 2),
... datetime(2011, 2, 7, 0, 4),
... datetime(2011, 2, 7, 0, 5),
... datetime(2011, 2, 7, 0, 5, 10),
... datetime(2011, 2, 7, 0, 6),
... datetime(2011, 2, 7, 0, 8),
... datetime(2011, 2, 7, 0, 9)]
>>> idx = pd.Index(idx)
>>> vals = np.arange(len(idx)).astype(float)
>>> ser = pd.Series(vals, index=idx)
>>> df = pd.DataFrame({'s1':ser, 's2':ser+1})
>>> time_offset_rolling_mean_df_ser(df, window_i_s='2min')
s1 s2
2011-02-07 00:00:00 0.0 1.0
2011-02-07 00:01:00 0.5 1.5
2011-02-07 00:01:30 1.0 2.0
2011-02-07 00:02:00 2.0 3.0
2011-02-07 00:04:00 4.0 5.0
2011-02-07 00:05:00 4.5 5.5
2011-02-07 00:05:10 5.0 6.0
2011-02-07 00:06:00 6.0 7.0
2011-02-07 00:08:00 8.0 9.0
2011-02-07 00:09:00 8.5 9.5
"""
def calculate_mean_at_ts(ts):
"""Function (closure) to apply that actually computes the rolling mean"""
if center_b == False:
dslice_df_ser = data_df_ser[
ts-pd.datetools.to_offset(window_i_s).delta+timedelta(0,0,1):
ts
]
# adding a microsecond because when slicing with labels start and endpoint
# are inclusive
else:
dslice_df_ser = data_df_ser[
ts-pd.datetools.to_offset(window_i_s).delta/2+timedelta(0,0,1):
ts+pd.datetools.to_offset(window_i_s).delta/2
]
if (isinstance(dslice_df_ser, pd.DataFrame) and dslice_df_ser.shape[0] < min_periods_i) or \
(isinstance(dslice_df_ser, pd.Series) and dslice_df_ser.size < min_periods_i):
return dslice_df_ser.mean()*np.nan # keeps number format and whether Series or DataFrame
else:
return dslice_df_ser.mean()
if isinstance(window_i_s, int):
mean_df_ser = pd.rolling_mean(data_df_ser, window=window_i_s, min_periods=min_periods_i, center=center_b)
elif isinstance(window_i_s, basestring):
idx_ser = pd.Series(data_df_ser.index.to_pydatetime(), index=data_df_ser.index)
mean_df_ser = idx_ser.apply(calculate_mean_at_ts)
return mean_df_ser
回答by InterwebIsGreat
I found that user2689410 code broke when I tried with window='1M' as the delta on business month threw this error:
当我尝试使用 window='1M' 时,我发现 user2689410 代码损坏了,因为营业月的增量引发了这个错误:
AttributeError: 'MonthEnd' object has no attribute 'delta'
I added the option to pass directly a relative time delta, so you can do similar things for user defined periods.
我添加了直接传递相对时间增量的选项,因此您可以为用户定义的时间段执行类似的操作。
Thanks for the pointers, here's my attempt - hope it's of use.
感谢您的指点,这是我的尝试 - 希望它有用。
def rolling_mean(data, window, min_periods=1, center=False):
""" Function that computes a rolling mean
Reference:
http://stackoverflow.com/questions/15771472/pandas-rolling-mean-by-time-interval
Parameters
----------
data : DataFrame or Series
If a DataFrame is passed, the rolling_mean is computed for all columns.
window : int, string, Timedelta or Relativedelta
int - number of observations used for calculating the statistic,
as defined by the function pd.rolling_mean()
string - must be a frequency string, e.g. '90S'. This is
internally converted into a DateOffset object, and then
Timedelta representing the window size.
Timedelta / Relativedelta - Can directly pass a timedeltas.
min_periods : int
Minimum number of observations in window required to have a value.
center : bool
Point around which to 'center' the slicing.
Returns
-------
Series or DataFrame, if more than one column
"""
def f(x, time_increment):
"""Function to apply that actually computes the rolling mean
:param x:
:return:
"""
if not center:
# adding a microsecond because when slicing with labels start
# and endpoint are inclusive
start_date = x - time_increment + timedelta(0, 0, 1)
end_date = x
else:
start_date = x - time_increment/2 + timedelta(0, 0, 1)
end_date = x + time_increment/2
# Select the date index from the
dslice = col[start_date:end_date]
if dslice.size < min_periods:
return np.nan
else:
return dslice.mean()
data = DataFrame(data.copy())
dfout = DataFrame()
if isinstance(window, int):
dfout = pd.rolling_mean(data, window, min_periods=min_periods, center=center)
elif isinstance(window, basestring):
time_delta = pd.datetools.to_offset(window).delta
idx = Series(data.index.to_pydatetime(), index=data.index)
for colname, col in data.iteritems():
result = idx.apply(lambda x: f(x, time_delta))
result.name = colname
dfout = dfout.join(result, how='outer')
elif isinstance(window, (timedelta, relativedelta)):
time_delta = window
idx = Series(data.index.to_pydatetime(), index=data.index)
for colname, col in data.iteritems():
result = idx.apply(lambda x: f(x, time_delta))
result.name = colname
dfout = dfout.join(result, how='outer')
if dfout.columns.size == 1:
dfout = dfout.ix[:, 0]
return dfout
And the example with a 3 day time window to calculate the mean:
以及使用 3 天时间窗口计算平均值的示例:
from pandas import Series, DataFrame
import pandas as pd
from datetime import datetime, timedelta
import numpy as np
from dateutil.relativedelta import relativedelta
idx = [datetime(2011, 2, 7, 0, 0),
datetime(2011, 2, 7, 0, 1),
datetime(2011, 2, 8, 0, 1, 30),
datetime(2011, 2, 9, 0, 2),
datetime(2011, 2, 10, 0, 4),
datetime(2011, 2, 11, 0, 5),
datetime(2011, 2, 12, 0, 5, 10),
datetime(2011, 2, 12, 0, 6),
datetime(2011, 2, 13, 0, 8),
datetime(2011, 2, 14, 0, 9)]
idx = pd.Index(idx)
vals = np.arange(len(idx)).astype(float)
s = Series(vals, index=idx)
# Now try by passing the 3 days as a relative time delta directly.
rm = rolling_mean(s, window=relativedelta(days=3))
>>> rm
Out[2]:
2011-02-07 00:00:00 0.0
2011-02-07 00:01:00 0.5
2011-02-08 00:01:30 1.0
2011-02-09 00:02:00 1.5
2011-02-10 00:04:00 3.0
2011-02-11 00:05:00 4.0
2011-02-12 00:05:10 5.0
2011-02-12 00:06:00 5.5
2011-02-13 00:08:00 6.5
2011-02-14 00:09:00 7.5
Name: 0, dtype: float64
回答by JohnE
This example seems to call for a weighted mean as suggested in @andyhayden's comment. For example, there are two polls on 10/25 and one each on 10/26 and 10/27. If you just resample and then take the mean, this effectively gives twice as much weighting to the polls on 10/26 and 10/27 compared to the ones on 10/25.
这个例子似乎需要一个加权平均值,正如@andyhayden 的评论中所建议的那样。例如,10/25 有两次投票,10/26 和 10/27 各有一次投票。如果您只是重新采样然后取平均值,则与 10/25 的投票相比,这有效地为 10/26 和 10/27 的民意调查提供了两倍的权重。
To give equal weight to each pollrather than equal weight to each day, you could do something like the following.
要为每个民意调查赋予同等权重而不是每天同等权重,您可以执行以下操作。
>>> wt = df.resample('D',limit=5).count()
favorable unfavorable other
enddate
2012-10-25 2 2 2
2012-10-26 1 1 1
2012-10-27 1 1 1
>>> df2 = df.resample('D').mean()
favorable unfavorable other
enddate
2012-10-25 0.495 0.485 0.025
2012-10-26 0.560 0.400 0.040
2012-10-27 0.510 0.470 0.020
That gives you the raw ingredients for doing a poll-based mean instead of a day-based mean. As before, the polls are averaged on 10/25, but the weight for 10/25 is also stored and is double the weight on 10/26 or 10/27 to reflect that two polls were taken on 10/25.
这为您提供了进行基于民意调查的平均值而不是基于日的平均值的原始成分。和以前一样,投票在 10/25 平均,但 10/25 的权重也被存储,并且是 10/26 或 10/27 权重的两倍,以反映在 10/25 进行了两次投票。
>>> df3 = df2 * wt
>>> df3 = df3.rolling(3,min_periods=1).sum()
>>> wt3 = wt.rolling(3,min_periods=1).sum()
>>> df3 = df3 / wt3
favorable unfavorable other
enddate
2012-10-25 0.495000 0.485000 0.025000
2012-10-26 0.516667 0.456667 0.030000
2012-10-27 0.515000 0.460000 0.027500
2012-10-28 0.496667 0.465000 0.041667
2012-10-29 0.484000 0.478000 0.042000
2012-10-30 0.488000 0.474000 0.042000
2012-10-31 0.530000 0.450000 0.020000
2012-11-01 0.500000 0.465000 0.035000
2012-11-02 0.490000 0.470000 0.040000
2012-11-03 0.490000 0.465000 0.045000
2012-11-04 0.500000 0.448333 0.035000
2012-11-05 0.501429 0.450000 0.032857
2012-11-06 0.503333 0.450000 0.028333
2012-11-07 0.510000 0.435000 0.010000
Note that the rolling mean for 10/27 is now 0.51500 (poll-weighted) rather than 52.1667 (day-weighted).
请注意,10/27 的滚动平均值现在是 0.51500(投票加权)而不是 52.1667(日加权)。
Also note that there have been changes to the APIs for resampleand rollingas of version 0.18.0.
还要注意,已更改的APIresample和rolling作为版本0.18.0。
rolling (what's new in pandas 0.18.0)
回答by Vlox
To keep it basic, I used a loop and something like this to get you started (my index are datetimes):
为了保持基本,我使用了一个循环和类似的东西来让你开始(我的索引是日期时间):
import pandas as pd
import datetime as dt
#populate your dataframe: "df"
#...
df[df.index<(df.index[0]+dt.timedelta(hours=1))] #gives you a slice. you can then take .sum() .mean(), whatever
and then you can run functions on that slice. You can see how adding an iterator to make the start of the window something other than the first value in your dataframes index would then roll the window (you could use a > rule for the start as well for example).
然后您可以在该切片上运行函数。您可以看到如何添加迭代器来使窗口的开始不是数据帧索引中的第一个值,然后会滚动窗口(例如,您也可以使用 > 规则作为开始)。
Note, this may be less efficient for SUPER large data or very small increments as your slicing may become more strenuous (works for me well enough for hundreds of thousands of rows of data and several columns though for hourly windows across a few weeks)
请注意,这对于 SUPER 大数据或非常小的增量可能效率较低,因为您的切片可能会变得更加费力(尽管对于几周内的每小时窗口,这对我来说足以处理数十万行数据和几列)
回答by evgps
Check that your index is really datetime, not strCan be helpful:
检查您的索引是真的datetime,不是str可以有帮助:
data.index = pd.to_datetime(data['Index']).values

