Pandas 将时间序列数据重新采样到 15 分钟和 45 分钟 - 使用多索引或列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/51705583/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:53:43  来源:igfitidea点击:

Pandas resample timeseries data to 15 mins and 45 mins - using multi-index or column

pythonpandasdataframetime-seriesmulti-index

提问by LucieCBurgess

I have some timeseries data as a Pandas dataframe which starts off with observations at 15 mins past the hour and 45 mins past (time intervals of 30 mins) then changes frequency to every minute. I want to resample the data so that it has a regular frequency of every 30 minutes, at 15 past and 45 past the hours for the whole dataframe.

我有一些时间序列数据作为 Pandas 数据框,从一小时过去 15 分钟和过去 45 分钟(时间间隔为 30 分钟)的观察开始,然后将频率更改为每分钟。我想对数据重新采样,以便它具有每 30 分钟的固定频率,在整个数据帧的过去 15 小时和过去 ​​45 小时。

I thought of two ways of achieving this.
1. Simply filter the dataframe for all observations at 15min and 45min, using the time-series data as a column in the dataframe.
2. Re-set the index so the time-series data is part of a multi-index (the 0th level of the index is the weather station, the 1st level is the time of the observation) and use the Pandas date-time timeseries functionality such as resample().

我想到了两种方法来实现这一点。
1. 使用时间序列数据作为数据帧中的一列,简单地过滤所有 15 分钟和 45 分钟观测值的数据帧。
2. 重新设置索引,使时间序列数据成为多索引的一部分(索引的第 0 级为气象站,第 1 级为观测时间)并使用 Pandas 日期时间时间序列等功能resample()

The original dataframe, weather, looks like this:

原始数据框天气如下所示:

                  parsed_time           Pressure  Temp    Hum
Station   (index)   
Bow       1        2018-04-15 14:15:00   1012     20.0    87
          2        2018-04-15 14:45:00   1013     20.0    87
          3        2018-04-15 15:15:00   1012     21.0    87
          4        2018-04-15 15:45:00   1014     22.0    86
          5        2018-04-15 16:00:00   1015     22.0    86
          6        2018-04-15 16:01:00   1012     25.0    86
          7        2018-04-15 16:02:00   1012     25.0    86
Stratford 8        2018-04-15 14:15:00   1011     18.0    87
          9        2018-04-15 14:45:00   1011     18.0    87
          10       2018-04-15 15:15:00   1012     18.0    87
          11       2018-04-15 15:45:00   1014     19.0    86
          12       2018-04-15 16:00:00   1014     19.0    86
          13       2018-04-15 16:01:00   1015     19.0    86
          14       2018-04-15 16:02:00   1016     20.0    86
          15       2018-04-15 16:04:00   1016     20.0    86

With method 1, I have the problem that my boolean select operations don't seem to work as expected. For example

使用方法 1,我遇到的问题是我的布尔选择操作似乎没有按预期工作。例如

weather_test = weather[weather['parsed_time'].dt.minute == (15 & 45)]

gives parsed_time values like this:

给出 parsed_time 值,如下所示:

2018-04-15 14:13:00
2018-04-15 15:13:00

2018-04-15 14:13:00
2018-04-15 15:13:00

weather_test = weather[weather['parsed_time'].dt.minute == (15 | 45)]

results in parsed_time values like this:

导致 parsed_time 值如下:

2018-04-15 14:47:00
2018-04-15 14:47:00

2018-04-15 14:47:00
2018-04-15 14:47:00

I can't find anything in the docs to explain this behaviour. What I want is pressure, temp, humidity by station at the following times:

我在文档中找不到任何内容来解释这种行为。我想要的是以下时间各站的压力、温度、湿度:

2018-04-15 14:45:00    
2018-04-15 15:15:00  
2018-04-15 15:45:00
2018-04-15 16:15:00

and so on.

等等。

With method 2, I thought of resampling the data so that observations for which I have minute-by-minute data are replaced by the mean of the previous 30 minutes. This functionality only seems to work if the parsed_time column is part of the index, so I used the following code to set the parsed_time as part of a multi-index:

使用方法 2,我考虑对数据重新采样,以便将我拥有逐分钟数据的观察结果替换为前 30 分钟的平均值。此功能似乎仅在 parsed_time 列是索引的一部分时才起作用,因此我使用以下代码将 parsed_time 设置为多索引的一部分:

weather.set_index('parsed_time', append=True, inplace=True)
weather.index.set_names('station', level=0, inplace=True)
weather = weather.reset_index(level=1, drop=True)

to end up with a dataframe that looks like this:

最终得到一个如下所示的数据框:

                                  Pressure   Temp    Hum
Station    parsed_time
Bow            2018-04-15 14:15:00    1012       20.0    87
           2018-04-15 14:45:00    1013       20.0    87
           2018-04-15 15:15:00    1012       21.0    87
           2018-04-15 15:45:00    1014       22.0    86
           2018-04-15 16:00:00    1015       22.0    86
           2018-04-15 16:01:00    1012       25.0    86
           2018-04-15 16:02:00    1012       25.0    86
Stratford  2018-04-15 14:15:00    1011       18.0    87
           2018-04-15 14:45:00    1011       18.0    87
           2018-04-15 15:15:00    1012       18.0    87
           2018-04-15 15:45:00    1014       19.0    86
           2018-04-15 16:00:00    1014       19.0    86
           2018-04-15 16:01:00    1015       19.0    86
           2018-04-15 16:02:00    1016       20.0    86
           2018-04-15 16:04:00    1016       20.0    86

Note that the sampling of observations varies from every 30 minutes at :15 past and :45 past to every minute (e.g. :01, :02, :14, etc), and it also varies by station - not all stations have every observation.

请注意,观测值的抽样从每 30 分钟过去 :15 点和过去 :45 分钟到每分钟不等(例如:01、:02、:14 等),并且它也因站点而异——并非所有站点都有每个观察结果。

I tried this:

我试过这个:

weather_test = weather.resample('30min', level=1).mean()

but this resamples without an offset and also gets rid of the station level in the multi-index.

但这会在没有偏移的情况下重新采样,并且还摆脱了多索引中的站点级别。

The desired result is this:

想要的结果是这样的:

                              Pressure   Temp    Hum
Station    parsed_time
Bow            2018-04-15 14:15:00    1012       20.0    87
           2018-04-15 14:45:00    1013       20.0    87
           2018-04-15 15:15:00    1012       21.0    87
           2018-04-15 15:45:00    1014       22.0    86
           2018-04-15 16:15:00    1013       24.0    86
Stratford  2018-04-15 14:15:00    1011       18.0    87
           2018-04-15 14:45:00    1011       18.0    87
           2018-04-15 15:15:00    1012       18.0    87
           2018-04-15 15:45:00    1014       19.0    86
           2018-04-15 16:15:00    1015       19.5    86

where the minute-by-minute observations have been resampled as the mean over a 30-minute interval at :15 and :45 past the hour.

其中每分钟的观察值已被重新采样为 30 分钟间隔内的平均值:15 点和 45 点过去一小时。

Keeping the station as a level in the multi-index is essential. I can't use the time index as an index on its own as the values repeat for each station (and are not unique).

将站点保持在多指标中的水平是必不可少的。我不能将时间索引单独用作索引,因为每个站的值都会重复(并且不是唯一的)。

All help appreciated as I have been going round in circles with this one for a while now. Thanks!

所有的帮助都表示感谢,因为我已经用这个绕圈子了一段时间了。谢谢!

I have looked at quite a few previous posts including: Boolean filter using a timestamp value on a dataframe in Python
How do I round datetime column to nearest quarter hour
and: Resampling a pandas dataframe with multi-index containing timeserieswhich seems a bit complicated for something that should be quite simple ...

我看过很多以前的帖子,包括: 布尔过滤器在 Python 中的数据帧上使用时间戳值
如何将日期时间列舍入到最接近的四分之一小时
和:使用包含时间序列的多索引重新采样Pandas数据帧,这似乎有点复杂应该很简单的东西......

and the docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.htmlThanks!

和文档:http: //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html谢谢!

采纳答案by chuni0r

Starting from your second last dataframe (after using weather.reset_index(Station, inplace=True)):

从倒数第二个数据帧开始(使用后weather.reset_index(Station, inplace=True)):

                           Station  Pressure  Temp   Hum
parsed_time                                         
2018-04-15 14:15:00        Bow    1012.0  20.0  87.0
2018-04-15 14:45:00        Bow    1013.0  20.0  87.0
2018-04-15 15:15:00        Bow    1012.0  21.0  87.0
2018-04-15 15:45:00        Bow    1014.0  22.0  86.0
2018-04-15 16:00:00        Bow    1015.0  22.0  86.0
2018-04-15 16:01:00        Bow    1012.0  25.0  86.0
2018-04-15 16:02:00        Bow    1012.0  25.0  86.0
2018-04-15 14:15:00  Stratford    1011.0  18.0  87.0
2018-04-15 14:45:00  Stratford    1011.0  18.0  87.0
2018-04-15 15:15:00  Stratford    1012.0  18.0  87.0
2018-04-15 15:45:00  Stratford    1014.0  19.0  86.0
2018-04-15 16:00:00  Stratford    1014.0  19.0  86.0
2018-04-15 16:01:00  Stratford    1015.0  19.0  86.0
2018-04-15 16:02:00  Stratford    1016.0  20.0  86.0
2018-04-15 16:04:00  Stratford    1016.0  20.0  86.0

you could use a combination of groupbyand resample:

你可以使用的组合groupbyresample

res = weather.groupby('Station').resample('30min').mean().reset_index('Station')

By default, resamplechooses the bin intervals [16:00, 16:30)and [16:30, 17:00). As you already noticed, the time index is resampled without an offset, but you can add it back afterwards using DateOffset:

默认情况下,resample选择 bin 间隔[16:00, 16:30)[16:30, 17:00)。正如您已经注意到的,时间索引是在没有偏移的情况下重新采样的,但您可以在之后使用DateOffset

res.index = res.index + pd.DateOffset(minutes=15)

which yields:

产生:

                           Station  Pressure  Temp   Hum
parsed_time                                         
2018-04-15 14:15:00        Bow   1012.00  20.0  87.0
2018-04-15 14:45:00        Bow   1013.00  20.0  87.0
2018-04-15 15:15:00        Bow   1012.00  21.0  87.0
2018-04-15 15:45:00        Bow   1014.00  22.0  86.0
2018-04-15 16:15:00        Bow   1013.00  24.0  86.0
2018-04-15 14:15:00  Stratford   1011.00  18.0  87.0
2018-04-15 14:45:00  Stratford   1011.00  18.0  87.0
2018-04-15 15:15:00  Stratford   1012.00  18.0  87.0
2018-04-15 15:45:00  Stratford   1014.00  19.0  86.0
2018-04-15 16:15:00  Stratford   1015.25  19.5  86.0

Alternatively, you could specifiy the offset directly in the resample method:

或者,您可以直接在 resample 方法中指定偏移量:

weather.groupby('Station').resample('30min', loffset=pd.Timedelta('15min')).mean()

回答by PMende

If you start off without any index (except for a row index), you can perform the following:

如果开始时没有任何索引(行索引除外),则可以执行以下操作:

# Create a rounded timestamp
df['parsed_time_rounded'] = (df['parsed_time'] - pd.Timedelta('15min')).dt.round('30min') + pd.Timedelta('15min')
# Group by the station, and the rounded timestamp instead of the raw timestamp
df.groupby(['Station', 'parsed_time_rounded']).mean()

回答by BossaNova

I don't have your data so I cannot examine this directly, but try the following syntax for the option you refer to as option 1:

我没有你的数据,所以我不能直接检查这个,但是对于你称为选项 1 的选项,请尝试以下语法:

weather_test = weather[(weather['parsed_time'].dt.minute == 15) | (weather['parsed_time'].dt.minute == 45)]