Pandas 将时间序列数据重新采样到 15 分钟和 45 分钟 - 使用多索引或列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/51705583/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas resample timeseries data to 15 mins and 45 mins - using multi-index or column
提问by LucieCBurgess
I have some timeseries data as a Pandas dataframe which starts off with observations at 15 mins past the hour and 45 mins past (time intervals of 30 mins) then changes frequency to every minute. I want to resample the data so that it has a regular frequency of every 30 minutes, at 15 past and 45 past the hours for the whole dataframe.
我有一些时间序列数据作为 Pandas 数据框,从一小时过去 15 分钟和过去 45 分钟(时间间隔为 30 分钟)的观察开始,然后将频率更改为每分钟。我想对数据重新采样,以便它具有每 30 分钟的固定频率,在整个数据帧的过去 15 小时和过去 45 小时。
I thought of two ways of achieving this.
1. Simply filter the dataframe for all observations at 15min and 45min, using the time-series data as a column in the dataframe.
2. Re-set the index so the time-series data is part of a multi-index (the 0th level of the index is the weather station, the 1st level is the time of the observation) and use the Pandas date-time timeseries functionality such as resample()
.
我想到了两种方法来实现这一点。
1. 使用时间序列数据作为数据帧中的一列,简单地过滤所有 15 分钟和 45 分钟观测值的数据帧。
2. 重新设置索引,使时间序列数据成为多索引的一部分(索引的第 0 级为气象站,第 1 级为观测时间)并使用 Pandas 日期时间时间序列等功能resample()
。
The original dataframe, weather, looks like this:
原始数据框天气如下所示:
parsed_time Pressure Temp Hum
Station (index)
Bow 1 2018-04-15 14:15:00 1012 20.0 87
2 2018-04-15 14:45:00 1013 20.0 87
3 2018-04-15 15:15:00 1012 21.0 87
4 2018-04-15 15:45:00 1014 22.0 86
5 2018-04-15 16:00:00 1015 22.0 86
6 2018-04-15 16:01:00 1012 25.0 86
7 2018-04-15 16:02:00 1012 25.0 86
Stratford 8 2018-04-15 14:15:00 1011 18.0 87
9 2018-04-15 14:45:00 1011 18.0 87
10 2018-04-15 15:15:00 1012 18.0 87
11 2018-04-15 15:45:00 1014 19.0 86
12 2018-04-15 16:00:00 1014 19.0 86
13 2018-04-15 16:01:00 1015 19.0 86
14 2018-04-15 16:02:00 1016 20.0 86
15 2018-04-15 16:04:00 1016 20.0 86
With method 1, I have the problem that my boolean select operations don't seem to work as expected. For example
使用方法 1,我遇到的问题是我的布尔选择操作似乎没有按预期工作。例如
weather_test = weather[weather['parsed_time'].dt.minute == (15 & 45)]
gives parsed_time values like this:
给出 parsed_time 值,如下所示:
2018-04-15 14:13:00
2018-04-15 15:13:00
2018-04-15 14:13:00
2018-04-15 15:13:00
weather_test = weather[weather['parsed_time'].dt.minute == (15 | 45)]
results in parsed_time values like this:
导致 parsed_time 值如下:
2018-04-15 14:47:00
2018-04-15 14:47:00
2018-04-15 14:47:00
2018-04-15 14:47:00
I can't find anything in the docs to explain this behaviour. What I want is pressure, temp, humidity by station at the following times:
我在文档中找不到任何内容来解释这种行为。我想要的是以下时间各站的压力、温度、湿度:
2018-04-15 14:45:00
2018-04-15 15:15:00
2018-04-15 15:45:00
2018-04-15 16:15:00
and so on.
等等。
With method 2, I thought of resampling the data so that observations for which I have minute-by-minute data are replaced by the mean of the previous 30 minutes. This functionality only seems to work if the parsed_time column is part of the index, so I used the following code to set the parsed_time as part of a multi-index:
使用方法 2,我考虑对数据重新采样,以便将我拥有逐分钟数据的观察结果替换为前 30 分钟的平均值。此功能似乎仅在 parsed_time 列是索引的一部分时才起作用,因此我使用以下代码将 parsed_time 设置为多索引的一部分:
weather.set_index('parsed_time', append=True, inplace=True)
weather.index.set_names('station', level=0, inplace=True)
weather = weather.reset_index(level=1, drop=True)
to end up with a dataframe that looks like this:
最终得到一个如下所示的数据框:
Pressure Temp Hum
Station parsed_time
Bow 2018-04-15 14:15:00 1012 20.0 87
2018-04-15 14:45:00 1013 20.0 87
2018-04-15 15:15:00 1012 21.0 87
2018-04-15 15:45:00 1014 22.0 86
2018-04-15 16:00:00 1015 22.0 86
2018-04-15 16:01:00 1012 25.0 86
2018-04-15 16:02:00 1012 25.0 86
Stratford 2018-04-15 14:15:00 1011 18.0 87
2018-04-15 14:45:00 1011 18.0 87
2018-04-15 15:15:00 1012 18.0 87
2018-04-15 15:45:00 1014 19.0 86
2018-04-15 16:00:00 1014 19.0 86
2018-04-15 16:01:00 1015 19.0 86
2018-04-15 16:02:00 1016 20.0 86
2018-04-15 16:04:00 1016 20.0 86
Note that the sampling of observations varies from every 30 minutes at :15 past and :45 past to every minute (e.g. :01, :02, :14, etc), and it also varies by station - not all stations have every observation.
请注意,观测值的抽样从每 30 分钟过去 :15 点和过去 :45 分钟到每分钟不等(例如:01、:02、:14 等),并且它也因站点而异——并非所有站点都有每个观察结果。
I tried this:
我试过这个:
weather_test = weather.resample('30min', level=1).mean()
but this resamples without an offset and also gets rid of the station level in the multi-index.
但这会在没有偏移的情况下重新采样,并且还摆脱了多索引中的站点级别。
The desired result is this:
想要的结果是这样的:
Pressure Temp Hum
Station parsed_time
Bow 2018-04-15 14:15:00 1012 20.0 87
2018-04-15 14:45:00 1013 20.0 87
2018-04-15 15:15:00 1012 21.0 87
2018-04-15 15:45:00 1014 22.0 86
2018-04-15 16:15:00 1013 24.0 86
Stratford 2018-04-15 14:15:00 1011 18.0 87
2018-04-15 14:45:00 1011 18.0 87
2018-04-15 15:15:00 1012 18.0 87
2018-04-15 15:45:00 1014 19.0 86
2018-04-15 16:15:00 1015 19.5 86
where the minute-by-minute observations have been resampled as the mean over a 30-minute interval at :15 and :45 past the hour.
其中每分钟的观察值已被重新采样为 30 分钟间隔内的平均值:15 点和 45 点过去一小时。
Keeping the station as a level in the multi-index is essential. I can't use the time index as an index on its own as the values repeat for each station (and are not unique).
将站点保持在多指标中的水平是必不可少的。我不能将时间索引单独用作索引,因为每个站的值都会重复(并且不是唯一的)。
All help appreciated as I have been going round in circles with this one for a while now. Thanks!
所有的帮助都表示感谢,因为我已经用这个绕圈子了一段时间了。谢谢!
I have looked at quite a few previous posts including:
Boolean filter using a timestamp value on a dataframe in Python
How do I round datetime column to nearest quarter hour
and: Resampling a pandas dataframe with multi-index containing timeserieswhich seems a bit complicated for something that should be quite simple ...
我看过很多以前的帖子,包括:
布尔过滤器在 Python 中的数据帧上使用时间戳值
如何将日期时间列舍入到最接近的四分之一小时
和:使用包含时间序列的多索引重新采样Pandas数据帧,这似乎有点复杂应该很简单的东西......
and the docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.htmlThanks!
和文档:http: //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html谢谢!
采纳答案by chuni0r
Starting from your second last dataframe (after using weather.reset_index(Station, inplace=True)
):
从倒数第二个数据帧开始(使用后weather.reset_index(Station, inplace=True)
):
Station Pressure Temp Hum
parsed_time
2018-04-15 14:15:00 Bow 1012.0 20.0 87.0
2018-04-15 14:45:00 Bow 1013.0 20.0 87.0
2018-04-15 15:15:00 Bow 1012.0 21.0 87.0
2018-04-15 15:45:00 Bow 1014.0 22.0 86.0
2018-04-15 16:00:00 Bow 1015.0 22.0 86.0
2018-04-15 16:01:00 Bow 1012.0 25.0 86.0
2018-04-15 16:02:00 Bow 1012.0 25.0 86.0
2018-04-15 14:15:00 Stratford 1011.0 18.0 87.0
2018-04-15 14:45:00 Stratford 1011.0 18.0 87.0
2018-04-15 15:15:00 Stratford 1012.0 18.0 87.0
2018-04-15 15:45:00 Stratford 1014.0 19.0 86.0
2018-04-15 16:00:00 Stratford 1014.0 19.0 86.0
2018-04-15 16:01:00 Stratford 1015.0 19.0 86.0
2018-04-15 16:02:00 Stratford 1016.0 20.0 86.0
2018-04-15 16:04:00 Stratford 1016.0 20.0 86.0
you could use a combination of groupby
and resample
:
你可以使用的组合groupby
和resample
:
res = weather.groupby('Station').resample('30min').mean().reset_index('Station')
By default, resample
chooses the bin intervals [16:00, 16:30)
and [16:30, 17:00)
. As you already noticed, the time index is resampled without an offset, but you can add it back afterwards using DateOffset
:
默认情况下,resample
选择 bin 间隔[16:00, 16:30)
和[16:30, 17:00)
。正如您已经注意到的,时间索引是在没有偏移的情况下重新采样的,但您可以在之后使用DateOffset
:
res.index = res.index + pd.DateOffset(minutes=15)
which yields:
产生:
Station Pressure Temp Hum
parsed_time
2018-04-15 14:15:00 Bow 1012.00 20.0 87.0
2018-04-15 14:45:00 Bow 1013.00 20.0 87.0
2018-04-15 15:15:00 Bow 1012.00 21.0 87.0
2018-04-15 15:45:00 Bow 1014.00 22.0 86.0
2018-04-15 16:15:00 Bow 1013.00 24.0 86.0
2018-04-15 14:15:00 Stratford 1011.00 18.0 87.0
2018-04-15 14:45:00 Stratford 1011.00 18.0 87.0
2018-04-15 15:15:00 Stratford 1012.00 18.0 87.0
2018-04-15 15:45:00 Stratford 1014.00 19.0 86.0
2018-04-15 16:15:00 Stratford 1015.25 19.5 86.0
Alternatively, you could specifiy the offset directly in the resample method:
或者,您可以直接在 resample 方法中指定偏移量:
weather.groupby('Station').resample('30min', loffset=pd.Timedelta('15min')).mean()
回答by PMende
If you start off without any index (except for a row index), you can perform the following:
如果开始时没有任何索引(行索引除外),则可以执行以下操作:
# Create a rounded timestamp
df['parsed_time_rounded'] = (df['parsed_time'] - pd.Timedelta('15min')).dt.round('30min') + pd.Timedelta('15min')
# Group by the station, and the rounded timestamp instead of the raw timestamp
df.groupby(['Station', 'parsed_time_rounded']).mean()
回答by BossaNova
I don't have your data so I cannot examine this directly, but try the following syntax for the option you refer to as option 1:
我没有你的数据,所以我不能直接检查这个,但是对于你称为选项 1 的选项,请尝试以下语法:
weather_test = weather[(weather['parsed_time'].dt.minute == 15) | (weather['parsed_time'].dt.minute == 45)]